Data Analysis

Name: Data Analysis
Author: supercent-io

The data-analysis skill provides tools for exploring, cleaning, analyzing, and visualizing datasets, enabling users to derive insights and support data-driven decisions. It offers step-by-step instructions for handling data in Python or SQL, performing statistical analysis, and creating visualizations like histograms, boxplots, and heatmaps. This skill is ideal for data analysts, data scientists, and anyone seeking to understand and report on their data efficiently and accurately.

npx skills add https://github.com/supercent-io/skills-template --skill data-analysis

Data Analysis

When to use this skill

Data exploration: Understand a new dataset
Report generation: Derive data-driven insights
Quality validation: Check data consistency
Decision support: Make data-driven recommendations

Instructions

Step 1: Load and explore data

Python (Pandas):

import pandas as pd
import numpy as np
# Load CSV
df = pd.read_csv('data.csv')
# Basic info
print(df.info())
print(df.describe())
print(df.head(10))
# Check missing values
print(df.isnull().sum())
# Data types
print(df.dtypes)

SQL:

-- Inspect table schema
DESCRIBE table_name;
-- Sample data
SELECT * FROM table_name LIMIT 10;
-- Basic stats
SELECT
    COUNT(*) as total_rows,
    COUNT(DISTINCT column_name) as unique_values,
    MIN(numeric_column) as min_val,
    MAX(numeric_column) as max_val,
    AVG(numeric_column) as avg_val
FROM table_name;

Step 2: Data cleaning

# Handle missing values
df['column'].fillna(df['column'].mean(), inplace=True)
df.dropna(subset=['required_column'], inplace=True)
# Remove duplicates
df.drop_duplicates(inplace=True)
# Type conversions
df['date'] = pd.to_datetime(df['date'])
df['category'] = df['category'].astype('category')
# Remove outliers (IQR method)
Q1 = df['value'].quantile(0.25)
Q3 = df['value'].quantile(0.75)
IQR = Q3 - Q1
df = df[(df['value'] >= Q1 - 1.5*IQR) & (df['value'] <= Q3 + 1.5*IQR)]

Step 3: Statistical analysis

# Descriptive statistics
print(df['numeric_column'].describe())
# Grouped analysis
grouped = df.groupby('category').agg({
    'value': ['mean', 'sum', 'count'],
    'other': 'nunique'
})
print(grouped)
# Correlation
correlation = df[['col1', 'col2', 'col3']].corr()
print(correlation)
# Pivot table
pivot = pd.pivot_table(df,
    values='sales',
    index='region',
    columns='month',
    aggfunc='sum'
)

Step 4: Visualization

import matplotlib.pyplot as plt
import seaborn as sns
# Histogram
plt.figure(figsize=(10, 6))
df['value'].hist(bins=30)
plt.title('Distribution of Values')
plt.savefig('histogram.png')
# Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='category', y='value', data=df)
plt.title('Value by Category')
plt.savefig('boxplot.png')
# Heatmap (correlation)
plt.figure(figsize=(10, 8))
sns.heatmap(correlation, annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.savefig('heatmap.png')
# Time series
plt.figure(figsize=(12, 6))
df.groupby('date')['value'].sum().plot()
plt.title('Time Series of Values')
plt.savefig('timeseries.png')

Step 5: Derive insights

# Top/bottom analysis
top_10 = df.nlargest(10, 'value')
bottom_10 = df.nsmallest(10, 'value')
# Trend analysis
df['month'] = df['date'].dt.to_period('M')
monthly_trend = df.groupby('month')['value'].sum()
growth = monthly_trend.pct_change() * 100
# Segment analysis
segments = df.groupby('segment').agg({
    'revenue': 'sum',
    'customers': 'nunique',
    'orders': 'count'
})
segments['avg_order_value'] = segments['revenue'] / segments['orders']

Output format

Analysis report structure

# Data Analysis Report
## 1. Dataset overview
- Dataset: [name]
- Records: X,XXX
- Columns: XX
- Date range: YYYY-MM-DD ~ YYYY-MM-DD
## 2. Key findings
- Insight 1
- Insight 2
- Insight 3
## 3. Statistical summary
| Metric | Value |
|------|-----|
| Mean | X.XX |
| Median | X.XX |
| Std dev | X.XX |
## 4. Recommendations
1. [Recommendation 1]
2. [Recommendation 2]

Best practices

Understand the data first: Learn structure and meaning before analysis
Incremental analysis: Move from simple to complex analyses
Use visualization: Use a variety of charts to spot patterns
Validate assumptions: Always verify assumptions about the data
Reproducibility: Document analysis code and results

Constraints

Required rules (MUST)

Preserve raw data (work on a copy)
Document the analysis process
Validate results

Prohibited (MUST NOT)

Do not expose sensitive personal data
Do not draw unsupported conclusions

References

Examples

Example 1: Basic usage

Example 2: Advanced usage

GitHub Owner

Owner: supercent-io

Data Analysis

When to use this skill

Instructions

Step 1: Load and explore data

Step 2: Data cleaning

Step 3: Statistical analysis

Step 4: Visualization

Step 5: Derive insights

Output format

Analysis report structure

Best practices

Constraints

Required rules (MUST)

Prohibited (MUST NOT)

References

Examples

Example 1: Basic usage

Example 2: Advanced usage

GitHub Owner

More skills