HxHippy

AI for Data Analysis & Processing

Use AI to analyze datasets, generate reports, and extract insights from data.

Last updated: 2024-12-18

Leverage AI for faster, more insightful data analysis without replacing your data engineering skills.

Data Analysis Workflow

Raw Data → Understanding → Cleaning → Analysis → Visualization → Insights
    ↑           ↑            ↑          ↑            ↑            ↑
   AI          AI           AI         AI           AI           AI

Phase 1: Data Understanding

Quick Data Profile

Here's a sample of my dataset (first 50 rows as CSV):

[paste data]

Provide:
1. Column analysis (types, distributions, nulls)
2. Potential data quality issues
3. Relationships between columns
4. Suggested cleaning steps
5. Analysis questions this data could answer

Schema Design

I have these raw log files:

[sample log entries]

Design a structured schema for storing this data in PostgreSQL.
Include:
- Table definitions with appropriate types
- Indexes for common queries
- Partitioning strategy if needed
- Sample insert/query statements

Phase 2: Data Cleaning

Generate Cleaning Scripts

Write a Python data cleaning script for this dataset:

Issues to fix:
- Column "date" has mixed formats (MM/DD/YYYY and YYYY-MM-DD)
- "price" column has currency symbols and commas
- "category" has inconsistent capitalization
- Several rows have missing "email" values
- Duplicate entries based on (user_id, timestamp)

Use pandas. Include validation checks.

SQL Data Cleaning

Write SQL queries to clean this orders table:

Problems:
- Some order_total values are negative
- customer_email has entries like "N/A", "none", empty strings
- shipped_date is sometimes before order_date
- duplicate orders within 1 minute from same customer

For each issue:
1. Query to identify affected rows
2. Query to fix or flag them
3. Validation query to confirm the fix

Phase 3: Analysis

Exploratory Analysis

Given this sales data schema:

orders(id, customer_id, product_id, quantity, total, created_at)
customers(id, segment, region, acquisition_date)
products(id, category, price, cost)

Write SQL queries for:
1. Monthly revenue trend (last 12 months)
2. Top 10 customers by lifetime value
3. Product category performance (revenue, margin, growth)
4. Customer cohort retention analysis
5. Seasonal patterns in purchasing

Include explanations of what each query reveals.

Statistical Analysis

Perform statistical analysis on this A/B test data:

Control group: n=5000, conversions=150
Treatment group: n=5000, conversions=185

Calculate:
1. Conversion rates for each group
2. Statistical significance (p-value)
3. Confidence interval for the difference
4. Required sample size for 80% power
5. Recommendation on whether to implement

Show the calculations and Python code.

Phase 4: Visualization

Chart Recommendations

I have this data about website traffic:

- Daily unique visitors (365 days)
- Traffic source breakdown (organic, paid, social, direct)
- Page views by section
- Bounce rate by landing page
- Conversion funnel stages

What visualizations should I create for:
1. Executive dashboard
2. Marketing team report
3. Technical performance review

Include chart types, key metrics to highlight, and sample Python/JS code.

Generate Visualization Code

Create a Python visualization for this data:

Data: Monthly sales by region (4 regions, 24 months)

Requirements:
- Line chart with all regions
- Highlight the region with highest growth
- Add trend lines
- Professional styling (not default matplotlib)
- Export as PNG and SVG

Use matplotlib or plotly.

Phase 5: Reporting

Automated Reports

Generate a weekly metrics report template:

Metrics to include:
- Revenue: total, vs last week, vs same week last year
- Orders: count, average value, by channel
- Customers: new, returning, churn indicators
- Top products: by revenue, by units, by growth
- Issues: failed payments, returns, support tickets

Format as:
1. Executive summary (3 bullet points)
2. Key metrics table
3. Detailed breakdowns
4. Action items

Include Python/SQL code to pull this data.

Insight Generation

Here's our Q3 sales data summary:

[paste summary statistics]

Generate insights for a board presentation:
1. What are the 3 most important trends?
2. What's surprising or concerning?
3. What actions should leadership consider?
4. What additional data would help answer open questions?

Common Data Tasks

SQL Query Generation

Natural language: "Find customers who bought product A but not product B
in the last 30 days, grouped by their signup month"

Schema:
customers(id, email, created_at)
orders(id, customer_id, product_id, created_at)
products(id, name, category)

Generate the SQL query with explanation.

Data Transformation

Transform this JSON API response into a flat CSV:

```json
{
  "users": [{
    "id": 1,
    "profile": {"name": "Alice", "tags": ["admin", "beta"]},
    "orders": [{"id": 100, "total": 50}, {"id": 101, "total": 75}]
  }]
}

Output format: user_id, user_name, tag, order_id, order_total (one row per user-tag-order combination)

Write Python code using pandas.


### Regular Expressions for Data

```text
Write regex patterns to extract:

1. Email addresses from text
2. Phone numbers (US format)
3. URLs (http/https)
4. IP addresses (v4)
5. Dates in various formats

Include Python code with test cases.

Best Practices

  1. Validate AI output: Always verify generated queries on a subset
  2. Preserve original data: Never modify source data directly
  3. Document transformations: Keep a log of all cleaning steps
  4. Test edge cases: AI may not handle all data anomalies
  5. Version control: Track analysis scripts like code
intermediate Workflows Updated 2024-12-18
  • ai data analysis
  • data processing
  • ai analytics
  • data insights
  • ai reports