Data Cleaning  |  Missing Data  |  Outliers  |  Sample Size  |  Data Sources

PART 1:   DATA CLEANING

Why Your Data is Giving Wrong Results: Common Data Cleaning Mistakes (and How to Fix Them)

You ran the numbers. The chart looks clean. But something feels off and it probably started long before the analysis.

There is a particular kind of frustration that researchers know well. You have spent weeks collecting data, you have set up your analysis neatly, and then the results come back looking strange. Relationships that should not exist. Averages that do not match reality. Significance where there probably is not any.

Most of the time, the culprit is not a fancy statistical error. It is a basic data cleaning mistake one that slipped through early and quietly corrupted everything downstream. Let us talk about the most common ones and, more importantly, how to catch them before they ruin your work.

Mistake #1: Treating all blanks as ‘no response’

A blank cell in your spreadsheet is not always a missing response. Sometimes it means zero. Sometimes the question was not applicable. Sometimes data simply was not entered. Lumping all of these together or worse, deleting them distorts your distributions in ways that are hard to trace later.

FIX
Before touching blanks, document why each type exists. Use distinct codes say, -99 for ‘not applicable’ and -98 for ‘data unavailable’ so you can filter or handle each type appropriately.

Mistake #2: Inconsistent formatting treated as different values

‘Male’, ‘male’, ‘M’, ‘MALE’ to a human, these mean exactly the same thing. To your software, they are four separate categories. This is one of the most common data entry disasters in any study that involves manually entered text fields.

The fix is simple in theory but tedious in practice: standardise everything. Pick a format and enforce it consistently across the whole dataset. Most tools like R, Python, or SPSS let you recode these values quickly with a single command once you know what you are looking for.

Mistake #3: Duplicate rows going unnoticed

If a survey respondent submitted their form twice, or a data export ran twice and got merged, you now have duplicated observations. Your sample size looks bigger than it is, and your variance narrows artificially. The results become misleadingly precise.

FIX
Always run a duplicate check before analysis. Use unique identifiers like respondent IDs or timestamps, and flag any rows where all key fields match exactly.

Mistake #4: Ignoring data types

Importing a dataset from Excel often sneaks in dates formatted as numbers, IDs formatted as integers that accidentally get averaged, and percentages stored as text strings. Every one of these can cause your software to throw the wrong calculation without warning you.

Mistake #5: Cleaning without a log

Perhaps the most dangerous mistake is cleaning your data without recording what you changed and why. Three months later, when your supervisor asks why certain rows are missing, you will not remember. And you will not be able to reproduce your own analysis.

Create a cleaning log a simple text file or notebook where every transformation is documented: what you removed, what you recoded, and the reason for each decision. This is not extra work. It is what separates reproducible research from a one-time guess.

Data cleaning feels like the boring part. But it is, without question, the most consequential part of any analysis. Catch the problems here, and everything after becomes dramatically more trustworthy.

PART 2:   MISSING DATA

Missing Data in Research? 5 Proven Methods to Handle It Correctly

Deleting missing rows feels like a clean solution. It rarely is.

Every researcher hits this moment: you open your dataset and find a column with gaps scattered through it like potholes. The temptation is immediate just delete those rows and move on. But missing data has patterns, and those patterns matter. How you handle the gaps directly determines the reliability of your conclusions.

Before you choose a method, there is one question you must answer: why is the data missing? The statistical world breaks this into three categories Missing Completely At Random (MCAR), Missing At Random (MAR), and Missing Not At Random (MNAR). Each requires a different approach.

#MethodDescription
01Listwise DeletionRemove any row with missing values. Only safe when data is MCAR and missingness is very low (under 5%).
02Mean / Median ImputationReplace missing values with the column average. Quick but underestimates variance and ignores relationships between variables.
03Regression ImputationPredict missing values using other variables in your dataset. Better than mean imputation but assumes a linear relationship.
04Multiple Imputation (MI)Creates several complete datasets with different imputed values, runs analyses on each, then pools results. The gold standard for MAR data.
05Maximum LikelihoodUses all available data to estimate parameters without actually filling in missing values. Ideal for structural equation models.

The one method most thesis writers use (and should not)

Listwise deletion is the default in almost every statistical software package. It is also dangerously overused. If your missing data is not truly random — if, say, lower-income respondents skipped a question about salary — then deleting those rows systematically removes a segment of your population. Your remaining sample is no longer representative. Your conclusions quietly become wrong.

What to actually do

Start by checking your missing data pattern. Run a Little’s MCAR test in SPSS or R to see whether the gaps are random. Visualize the pattern are the same respondents missing across multiple variables? That is a red flag for MNAR.

For most academic research with moderate missingness (5-20%), Multiple Imputation is your best friend. It is available in SPSS, R’s mice package, and Stata. Yes, it takes a bit more setup. But it respects the uncertainty caused by missing data instead of pretending it does not exist.

KEY RULE
Always report how much data was missing, where it was missing, and which method you used to handle it. Reviewers and examiners will ask. You want a clear, defensible answer ready.

Missing data is not a flaw to hide. Handled correctly, it is just another part of rigorous research.

PART 3:   OUTLIERS

Outliers in Your Dataset: Should You Remove or Keep Them?

An outlier is either a mistake, a signal, or the most interesting observation in your entire study. You need to know which one it is.

There is no question in data analysis that produces more heated disagreement than this one. Some researchers remove outliers as a matter of routine. Others consider it intellectual dishonesty. Both camps have a point — because the right answer genuinely depends on what the outlier actually is.

First: define what you mean by ‘outlier’

An outlier is any observation that sits far from the rest of your data. But ‘far’ needs a mathematical definition. Common methods include the IQR rule (values beyond 1.5 times the interquartile range), Z-scores (values more than 3 standard deviations from the mean), and visual inspection via box plots or scatter plots.

The method matters less than the question you ask after identifying the point: why is this value extreme?

Three types of outliers, three different responses

Data entry errors. Someone typed 550 instead of 55. A survey collected ages, and one respondent entered 7 when they meant 70. These are not real observations; they are mistakes. Remove them. Document that you did.

Legitimate extreme values. A CEO’s salary in a study of employee incomes is genuinely extreme, but it is real. A patient who recovered exceptionally fast is a real outlier. Removing these because they make your analysis cleaner is a form of bias; you are shaping your data to fit your expectations. Keep them, and if they heavily influence your results, run the analysis with and without them and report both.

Outliers from the wrong population. If your study targets adults and one respondent is clearly a child who slipped through, that observation is not an outlier; it is the wrong subject. Exclusion is appropriate, and you should describe the exclusion criteria transparently.

THE GUIDING PRINCIPLE
Remove an outlier only when you have a documented, pre-specified reason. Removing it because it weakens your results is p-hacking. Removing it because it is a data entry error is data cleaning. The motivation is everything.

What to do instead of removing

Before reaching for the delete key, consider transforming your data. A log transformation can pull extreme values closer to the center of your distribution without excluding any observations. Non-parametric tests like the Mann-Whitney or Kruskal-Wallis are naturally more resistant to outlier influence than their parametric counterparts.

You can also run a sensitivity analysis — report your main findings, then report what happens if the outlier is excluded. If the conclusions hold either way, the outlier does not change your story. If they do not, the outlier IS the story, and you need to investigate it, not bury it.

BOTTOM LINE
Never remove an outlier silently. Every decision about unusual data should appear in your methodology section with justification. Transparency here is not optional — it is what makes your findings credible.

PART 4:   SAMPLE SIZE

Sample Size Confusion in Thesis? A Practical Guide to Getting It Right

Too small and your results are meaningless. Too large and you are detecting effects that do not matter. Sample size is a decision, not a guess.

Ask ten thesis students how they decided their sample size, and at least seven will admit, quietly, that they used whatever they could collect in the time available. It is understandable. But it is also how studies end up underpowered, or how viva examiners ask the question every researcher dreads: ‘How did you determine your sample size?’

You need a real answer. Here is how to get one.

Start with power analysis

Power analysis is the formal process of calculating how many participants you need to detect an effect of a given size with a given level of confidence. It requires you to specify four things before you collect a single data point:

Alpha (α) your significance threshold, almost always 0.05.

Power (1-β) your probability of detecting a real effect, conventionally set at 0.80.

Effect size how large an effect you expect to find, based on theory or prior research.

Test type the specific statistical test you will be running.

Free tools like G*Power make this calculation straightforward. Plug in the numbers, and it tells you how many participants you need. That number goes in your methodology section, alongside the rationale for each input you used.

Where most students go wrong with effect size

Effect size is the one input researchers tend to guess at — or worse, assume is ‘medium’ because Cohen’s d = 0.5 is the most cited number in methods textbooks. But if your actual effect is small (d = 0.2), a sample sized for a medium effect will leave you dramatically underpowered. Your study will miss real effects and produce unreliable p-values.

A much better approach: look at similar studies in your field. What effect sizes did they report? Use those as your benchmark. If the literature is sparse, a pilot study with 20-30 participants can give you a rough estimate to work from.

QUALITATIVE RESEARCH NOTE
Power analysis applies to quantitative studies. For qualitative work, sample size is guided by data saturation you keep collecting until new participants stop offering new themes. Most phenomenological studies reach saturation at 8-15 participants; grounded theory studies often need 20-30. Document how you made this judgment.

What if you have already collected your data?

Post-hoc power analysis calculating power after the fact is widely considered problematic and is increasingly rejected by journals. If your results came back non-significant and you run a power analysis to show you were underpowered, reviewers will notice the circularity. It is better to acknowledge limitations honestly and recommend adequately powered replication in your future directions.

If you have not collected data yet, even a brief pilot run is worth the time. It gives you real-world estimates and protects your main study from being built on a guess.

THE ONE-LINE ANSWER FOR YOUR VIVA
Sample size was determined via a priori power analysis using G*Power (version X), with alpha = .05, power = .80, and an expected effect size of [X] based on [prior study]. This yielded a required n of [X].

PART 5:  DATA SOURCES

Primary vs Secondary Data: Which One Should You Use for Your Study?

The right data source depends on your research question, your timeline, and your honesty about what you can realistically access.

One of the first decisions you make in any research project — and one that shapes every decision after it — is where your data will come from. Will you collect it yourself, or will you use data that already exists? Both options are entirely legitimate. Neither is automatically superior. What matters is the fit between your data source and your research question.

What is primary data?

Primary data is data you collect yourself, directly from the source, specifically for your research purpose. Surveys, interviews, focus groups, observations, experiments — all of these produce primary data. You design the instruments, you recruit the participants, you control the quality of collection.

The advantage is precision. Your data answers exactly the questions you need it to answer, in the form you need it, from the population you are actually studying. The disadvantage is cost — in time, resources, and logistical complexity.

What is secondary data?

Secondary data was collected by someone else, for a purpose that may or may not overlap with yours. Government datasets, published census data, hospital records, corporate annual reports, prior studies’ datasets made available for reuse — these are all secondary sources.

The advantage is scale and depth. A government health survey might cover 50,000 households across ten years. You could never replicate that scope independently. The disadvantage is fit: the variables measured may not be quite what you need, the population sampled may differ from yours, and the collection methods are outside your control.

#MethodDescription
01Primary dataBest when your topic is specific or novel, no existing data captures your variables, or you need direct access to a particular population.
02Secondary dataBest when large samples are needed, longitudinal trends matter, or your question is well-served by existing administrative or survey data.
03Mixed approachBest when secondary data provides context or benchmarks, while primary data adds depth, nuance, or population-specific insight.

The questions to ask before deciding

Does existing data already answer my question? A surprising number of researchers collect primary data for questions that national datasets answer perfectly well. Check the available data first.

Can I realistically access my target population? If your study examines senior executives, hospital patients, or conflict-affected communities, primary data collection may be logistically impossible. Secondary data may be your only option and that is a legitimate methodological choice.

What level of measurement do I need? If you need very specific psychological scales, behavioral observations, or experimental manipulations, no secondary dataset will contain them. You must go primary.

FOR THESIS WRITERS SPECIFICALLY
Whichever you choose, justify it. Your methodology chapter should explain why this data source fits your research design, its limitations, and how you have addressed those limitations. Saying ‘I used a survey because it is common in my field’ is not justification. Saying ‘primary data was necessary because no existing dataset captures perceived organizational justice at the team level in the Pakistani banking sector’ is.

A final thought

The best data is the data that most faithfully answers your research question while being practically accessible to you. Do not let ambition push you toward a primary collection effort you cannot complete properly. And do not let convenience push you toward secondary data that fundamentally does not fit your question. The choice is yours but it should be a deliberate one.

One Response

  1. Great insights on Data Issues & Foundations. Building a strong data foundation is essential for accurate analysis, better decision-making, and long-term scalability. Addressing data quality, consistency, and governance early can significantly improve overall business performance. Looking forward to more detailed discussions on this topic!

Leave a Reply

Your email address will not be published. Required fields are marked *