🧵 First Steps with Text Strings in Python: [Part 3 of 3]: Transforming Text with Essential Methods

"30 minutes of interviewing... and 4 hours cleaning up the transcript?"

As a qualitative researcher, I used to spend days on repetitive tasks:

  • Converting inconsistent responses ("YES", "Yes", "yes") to a standard format.

  • Cleaning transcripts plagued with extra spaces and line breaks.

  • Extracting specific codes (like "ID_2023_P15") from hundreds of files.

  • Standardizing names ("garcía, ana", "ANA GARCÍA", "Ana Garcia").

Now I solve in minutes with Python what used to take me hours in Excel. In this practical guide, I'll show you how to automate common qualitative research tasks.

What will you learn?

  • The essential Python methods that will solve 80% of your text problems.

  • Proven patterns to automate qualitative data cleaning.

  • Techniques to unify responses from multiple surveyors.

  • Reusable solutions for processing large volumes of textual data.

Python Methods: Your Toolbox for Qualitative Analysis

What is a method in Python? If you're new to programming, think of methods as specialized tools. Just as you would use different tools to clean, cut, or join materials, in Python we use different methods to transform text.

As a qualitative researcher, I face three common challenges when processing text:

  1. Format inconsistency: Different surveyors use different formats (uppercase/lowercase, spaces).

  2. "Dirty" data: Extra spaces, special characters, unnecessary line breaks.

  3. Need for standardization: Converting different versions of the same text to a unique format.

Let's look at a practical example of how Python can help with these challenges:

response = "   pedro páramo   "

# Here, .strip() is a method that removes extra spaces

response = response.strip()     # result: "pedro páramo"
print(response)

# And .title() is another method that capitalizes each word

response = response.title()     # result: "Pedro Páramo"
print(response)

Note: Notice how each method (.strip(), .title()) performs a specific task. This modularity allows us to transform text step by step in a controlled manner.

You'll recognize a method by the dot (.) after your variable name. It's like telling Python: "take this text and apply this specific transformation to it."

Now, let's explore the most useful tools for our research, organized by common tasks.

Basic Qualitative Data Cleaning

1. Standardizing Upper and Lower Case

The problem: In qualitative research, responses come in various formats. For example, for a simple "yes" response, we might find: "YES", "Yes", "yes", "YÉS", etc. This inconsistency complicates frequency and pattern analysis.

Python offers three main methods for standardizing the use of upper and lower case:

  • lower(): Converts all text to lowercase - ideal for comparing responses.

  • title(): Converts the first letter of each word to uppercase - perfect for proper names.

  • upper(): Converts all text to uppercase - useful for codes or identifiers.

Let's see these methods in action.

Scenario: You're processing survey responses where different surveyors have recorded responses inconsistently:

# Standardizing responses to lowercase for comparison
response1 = "YES"
response2 = "Yes"
response3 = "yes"

new_response1 = response1.lower()  # result: "yes"
new_response2 = response2.lower()  # result: "yes"
new_response3 = response3.lower()  # result: "yes"

print(f"""Response1 was standardized from '{response1}' to '{new_response1}',
Response2 was standardized from '{response2}' to '{new_response2}',
And response3 was standardized from '{response3}' to '{new_response3}',
""")

# For proper names or program titles

program_name = "youth and employment program"
formatted_name = program_name.title()
print(f"The program name changed from '{program_name}' to '{formatted_name}'")  # Result: "Youth And Employment Program"


# For codes or identifiers

project_code = "py123"
standard_code = project_code.upper()
print(f"The code changed from '{project_code}' to '{standard_code}'")  # Result: "PY123"

Pro Tip

Recommended workflow:

  1. Before doing any frequency analysis or comparisons, use .lower()

  2. Always keep a copy of the original unmodified data.

  3. Apply .title() only at the end, when preparing reports.

2. Removing Unnecessary Spaces

The problem: Transcripts and field data often have extra spaces, tabs, and line breaks that complicate analysis. For example, two identical responses might appear different due to extra spaces:

  • " yes " vs "yes"

  • "San Juan de Lurigancho " vs "San Juan de Lurigancho"

Solution: Python offers specific methods for cleaning these problematic spaces:

# Example of responses with problematic spaces
original_response = "    Yes, I agree   "
location = "   San Juan    de Lurigancho  "
comment = "Very\t\tgood\n\n"  # Tabs and line breaks

# Cleaning the data

clean_response = original_response.strip()
print(f"Original: '{original_response}'")
print(f"Clean   : '{clean_response}'")

# For more specific cases:

clean_location = " ".join(location.split())  # Removes multiple spaces
print(f"Original: '{location}'")
print(f"Clean   : '{clean_location}'")

# Cleaning line breaks and tabs

clean_comment = comment.strip().replace('\t', ' ').replace('\n', '')
print(f"Original: '{comment}'")
print(f"Clean   : '{clean_comment}'")

Pro Tip

Recommended cleaning strategy:

  1. First apply .strip() to remove spaces at the beginning and end

  2. Then use .split() followed by .join() to normalize spaces between words

  3. Finally, remove special characters with .replace()

Important: Always document the transformations made to maintain data traceability.

3. Processing Identifiers and Codes

The problem: In research projects, we work with standardized identifiers that need processing. For example:

  • Participant codes (PART_001_2023)

  • File names (INT_MariaGarcia_20231015.txt)

  • Form URLs

  • Pre/post test codes

Solution: Python offers specific methods for extracting and processing these structures:

# Cleaning participant codes
participant_code = "PART_001_2023"
clean_name = participant_code.removeprefix("PART_")  # Result: "001_2023"
print(f"""
Original: {participant_code}
Extracted: {clean_name}
""")

# Extracting transcription file names

transcript_file = "INT_MariaGarcia_20231015.txt"
interview_name = transcript_file.removesuffix(".txt")  # Result: "INT_MariaGarcia_20231015"
print(f"""
Original: {transcript_file}
Extracted: {interview_name}
""")

# Example with form URLs

form_url = "https://forms.google.com/project_responses"
form_name = form_url.removeprefix("https://forms.google.com/")
print(f"""
Original: {form_url}
Extracted: {form_name}
""")

# Processing multiple prefixes/suffixes

complex_code = "PRE_TEST_001_POST"
clean_code = (complex_code
              .removeprefix("PRE_")
              .removesuffix("_POST"))
print(f"""
Original: {complex_code}
Clean: {clean_code}
""")

Pro Tip

Processing strategies:

1. For predictable codes:

  • Use removeprefix() and removesuffix() to remove known parts

  • These are safer than replace() for this purpose

2. For complex structures:

  • Break down the process into small, documented steps

  • Keep a record of transformations made

Important: Always verify that the prefix or suffix exists before trying to remove it to avoid errors.

4. Separating and Modifying Responses

The problem: In qualitative research, we frequently need to:

  • Separate multiple responses into individual categories

  • Standardize transcript formats

  • Process concatenated demographic data

  • Unify location formats

Solution: Our main methods are split() and replace():

# 1. Separating multiple responses

multiple_response = "Education, Health, Housing"
topics = multiple_response.split(",")
print(f"""
Original response: {multiple_response}
Separated topics: {topics}
""")

# 2. Processing interview transcripts

transcript = "I: What do you think about the program?\nP: I found it very useful"
clean_transcript = transcript.replace("I:", "Interviewer:").replace("P:", "Participant:")
print(f"""
Original transcript:
{transcript}

Formatted transcript:
{clean_transcript}
""")

# 3. Standardizing region codes

location = "Lima-Peru"
standard_location = location.replace("-", ", ")
print(f"""
Original location: {location}
Standardized location: {standard_location}
""")

# 4. Separating demographic data

demographic_data = "Maria|25|F|Lima"
name, age, gender, city = demographic_data.split("|")
print(f""": {name}
Age: {age}
Gender: {gender}
City: {city}
""")

Professional Tip

Processing strategies:

1. For data separation:

  • Identify the correct delimiter: comma, pipe, or tab.

  • Use split() to create manageable lists.

  • Consider strip() on each resulting element.

2. For replacements:

  • Perform replacements consistently.

  • Document each transformation.

  • Verify there are no undesired side effects.

Important:

  • split() always returns a list of elements

  • replace() generates a new string

  • Before transforming, save the original data

Other Useful String Methods

So far, we've focused on the most frequent methods for qualitative data processing. However, Python offers us a broader set of tools that can enrich our analysis.

Why Do We Need More Methods?

In social research, we frequently encounter specific needs such as:

  • Searching for patterns in extensive transcripts (using find(), count())

  • Validating formats in demographic data (using isalpha(), isdigit())

  • Aligning text in reports (using center(), ljust(), rjust())

  • International comparisons (using casefold())

The following table presents additional methods to solve these challenges:

Method Description Example Result
capitalize() Converts only the first letter to uppercase "hello world".capitalize() "Hello world"
casefold() Similar to lower(), but more aggressive for international comparisons "Straße".casefold() "strasse"
center() Centers the text in a given space "python".center(10, "-") "--python--"
count() Counts occurrences of a substring "banana".count("a") 3
find() Finds the first position of a substring "Hello world".find("world") 6
format() Formats a string with specific values "{} is {} years old".format("Ana", 25) "Ana is 25 years old"
index() Similar to find(), but raises error if not found "Hello world".index("world") 6
join() Joins list elements with the specified string "-".join(["a", "b", "c"]) "a-b-c"
ljust() Left-aligns the text "hello".ljust(10, "") "hello****"
rjust() Right-aligns the text "hello".rjust(10, "*") "*****hello"
swapcase() Inverts upper and lower case "Hello World".swapcase() "hELLO wORLD"
translate() Replaces characters according to a table "Hello".translate(str.maketrans("o", "0")) "Hell0"
isalpha() Verifies if all characters are letters "Python3".isalpha() False
isdigit() Verifies if all characters are numbers "12345".isdigit() True
startswith() Verifies if string starts with a substring "Python".startswith("Py") True
endswith() Verifies if string ends with a substring "file.txt".endswith(".txt") True

When to Use Each Method?

  • count(): Keyword frequency.

  • find()/index(): Locating specific segments.

  • startswith()/endswith(): Code validation.

For data validation:

  • isalpha(): Pure text fields.

  • isdigit(): Numeric fields.

  • casefold(): Case-insensitive comparisons.

For reports and presentation:

  • center(): Centered titles.

  • ljust()/rjust(): Column alignment.

  • format(): Report templates.

You don't need to memorize all these methods. The key is knowing they exist and consulting them when a specific need arises in your research.

Want to Learn More?

Additional Resources:

  1. Use help(str) in your Python interpreter.

  2. Visit Python's official documentation.

  3. Check online resources like W3Schools.

The best way to learn these methods is by practicing with real examples and experimenting with different combinations.

From Learning to Practice

As social researchers, our work with text typically follows three stages:

  1. Preparation: Cleaning and standardizing raw data.

  2. Processing: Transforming and structuring information.

  3. Analysis: Generating reports and results.

The following exercises simulate this real workflow, allowing you to apply the methods we just learned in your research.

Exercise 1: Cleaning Transcripts

Context: You've conducted interviews about the impact of a social program. Before analysis, you need to resolve formatting inconsistencies in the transcripts, made by different assistants.

transcript = """
I: "How    would you    describe the    program    ?"
P: "VERY USEFUL,     actually"
"""

Your task is to:

  1. Remove multiple spaces between words

  2. Convert "I:" and "P:" to "Interviewer:" and "Participant:" respectively

  3. Convert participant responses from uppercase to title case

  4. The final result should look like this:

Interviewer: "How would you describe the program?"
Participant: "Very Useful, Actually"

Hints:

  • Use split() and join() to handle multiple spaces

  • Use replace() to change I: and P: labels

  • Use title() to format the participant's response

See solution:

transcript = """
I: "How    would you    describe the    program    ?"
P: "VERY USEFUL,     actually"
"""

# 1. Remove extra spaces at beginning and end
text_no_spaces = transcript.strip()

# 2. Replace multiple spaces with a single space
clean_text = text_no_spaces.replace("    ", " ")

# 3. Replace prefixes
final_text = clean_text.replace("I:", "Interviewer:")
final_text = final_text.replace("P:", "Participant:")

print("Original transcript:")
print(transcript)
print("\nProcessed transcript:")
print(final_text)

Exercise 2: Basic Record Processing

Context: You're coordinating a training program and need to process attendance records to generate executive reports. The data comes in raw format and needs to be structured.

record = "  JANE SMITH,25,F,YES,2024-01-15  "

Your task is to process this text to create a summary in the following format:

Name: Jane Smith | Age: 25 | Attendance: Present | Email: jane.smith@company.com

To achieve this, you'll need to:

  1. Remove extra spaces at the beginning and end.

  2. Split the fields using the comma.

  3. Format the name in title case.

  4. Convert 'YES' to 'Present'.

  5. Create the email using the name in lowercase.

Hints:

  • Use strip() to clean spaces.

  • split(',') to separate fields.

  • title() to format the name.

  • replace() to change YES/NO.

  • lower() and split() to create the email.

# Event record:
record = "  JANE SMITH, 25, F, YES, 2024-01-15  "

# 1. Clean and split fields
record = record.strip()
fields = record.split(',')

# 2. Process name and create email
name = fields[0].title()
name_parts = name.lower().split()
email = f"{name_parts[0]}.{name_parts[1]}@company.com"

# 3. Format attendance
attendance = fields[3].replace('YES', 'Present')

# 4. Create summary using f-string
summary = f"Name: {name} | Age: {fields[1]} | Attendance: {attendance} | Email: {email}"
print(summary)

Exercise 3: Advanced CSV Data Processing

Context: Your team has completed a training program evaluation. You need to transform raw form data into an executive report for stakeholders.

form_record = "   123,IMPACT-PROGRAM,john.doe@org.edu,30,SATISFACTORY!!!   "

This is a record from a form exported to CSV where:

  • Column 1: Report number.

  • Column 2: Program name.

  • Column 3: Facilitator's email.

  • Column 4: Number of participants.

  • Column 5: Training outcome.

Your goal is to clean this record to generate an executive report. The code should:

  1. Remove unnecessary spaces.

  2. Split fields using the comma.

  3. Format each field appropriately:

    1. Number: keep only digits.

    2. Program: convert "IMPACT-PROGRAM" to "Impact Program".

    3. Email: convert "john.doe@org.edu" to "John Doe".

    4. Participants: keep only the number.

    5. Outcome: remove exclamation marks and capitalize.

The result should look like this:

Report #123
Program: Impact Program
Facilitator: John Doe
Participants: 30
Outcome: Satisfactory

Try it before looking at the solution!

See solution:

"""
# 1. Store original data in a variable. This text has extra spaces
at the beginning and end, and its fields are comma-separated
"""
form_record = "   123,IMPACT-PROGRAM,john.doe@org.edu,30,SATISFACTORY!!!   "

#2. Remove unnecessary spaces at beginning and end using .strip()
clean_record = form_record.strip()

"""
3. Split each part using split() and store each in a different variable,
in the order they appear
"""
number, program, email, participants, outcome = clean_record.split(",")

"""
4. Improve program name format, replacing hyphen with space
and capitalize first letter of each word using .title()
"""
program = program.replace("-", " ").title()

"""
5. Extract and format facilitator's name. Take only the part before @
using slicing and .find(). Then replace dot with space and
capitalize first letter of each word.
"""
facilitator_name = email[:email.find("@")].replace(".", " ").title()

"""
6. Process the outcome. Remove exclamation marks and convert only the
first letter to uppercase using .capitalize()
"""
outcome = outcome.replace("!", "").capitalize()

# 7. Create final report using f-string:
report = f"""
Report #: {number}
Program: {program}
Facilitator: {facilitator_name}
Participants: {participants}
Outcome: {outcome}
"""

# 8. Show the result
print(report)

This solution primarily uses:

  • Basic string methods like .strip() for cleaning spaces

  • .replace() to change characters, and .title() and .capitalize() for formatting uppercase

  • The .split() method to separate text into individual fields

  • Slicing techniques and .find() to process the email

  • f-strings to create the final report in a readable and structured way

The key is to process the data step by step, transforming each field as needed.

Practical Tips from My Learning Experience

As a social science researcher who learned to program, I've discovered that much of our work involves processing and transforming text. These Python methods will help you automate repetitive tasks and maintain data consistency. Whether analyzing interviews, cleaning survey data, or preparing reports, these methods have become invaluable tools in my daily work.

Strings are Immutable

In Python, strings are immutable, meaning they cannot be modified once created. String methods don't modify the original; they create a new one.

Let's see an example:

# Example of immutability
name = "john smith"
name.title()        # This does NOT modify 'name'
print(name)         # Prints: "john smith"

# To save changes, you need to reassign the result
name = name.title()
print(name)         # Now it prints: "John Smith"

Method Chaining

Each method returns a new string. We can chain several in a single line:

text = "   HELLO WORLD   "
result = text.strip().lower().title()
print(result)      # Prints: "Hello World"

Other Important Tips

  • Experiment by combining different methods to achieve the desired result.

  • These methods are especially useful when working with user input.

  • If you need to apply multiple methods, consider whether it's more readable to do it in one line or separate steps.

As learners on this programming journey, these tools open up a world of possibilities for working with text in Python. The best way to master them is by practicing with examples related to our interests and needs. Let's keep learning together!

 
Previous
Previous

🐍 Python and Numbers: The Definitive Guide for Social Researchers [Part 1 of 2]

Next
Next

🧵 First Steps with Text Strings in Python: [Part 2 of 3]: f-strings, Indexing, and Slicing