正则表达式以确定的顺序删除第一次出现的字母
Regular expression to remove first occurrence of letters in a determined order
我正在尝试使用 python 和 tabula 包来抓取带有表格的 pdf。在某些情况下,提取的两列完全混淆了。我知道“类型”列应该只有这两个值:EE-Male 或 EE-Female。因此,我需要删除“类型”列中所有多余的字母,并按照它们出现的确切顺序将它们放在“名称”列的末尾。
Name Type
CHAK NO.162 NB PURANI AB AEDEI-Male
EXCELLENT (ATTACH WITH GC EEET-)M JaEleHLUM
PIND KHAN (TRATANI SAMAN EDE) -Female
BASTI JAM SUMMAR PO RUKA NEEP-UMRale
BASTI QAZIAN P/O KHANBEL AEE-Female
GHAUS PUR MACHIAN PO RU EKEA-FNe PmUaRle
NOOR MUHAMMAD CHEENR AELE W-FAemLAale
PHATHI THARO KHELAN WAL EI E-Female
WAH SAIDAN PO DAJAL RANJA ENE P-MUaRle
因此我需要这两列:
Name Type
CHAK NO.162 NB PURANI ABADI EE-Male
EXCELLENT (ATTACH WITH GCET) JEHLUM EE-Male
PIND KHAN (TRATANI SAMAND) EE-Female
BASTI JAM SUMMAR PO RUKANPUR EE-Male
BASTI QAZIAN P/O KHANBELA EE-Female
GHAUS PUR MACHIAN PO RUKAN PUR EE-Female
NOOR MUHAMMAD CHEENRAL WALA EE-Female
PHATHI THARO KHELAN WALI EE-Female
WAH SAIDAN PO DAJAL RANJAN PUR EE-Male
有什么建议吗?谢谢!
您想在哪里/如何做?由于 tabula 是一个 Java 库,我假设你想使用 Java。所以这是一种方法,虽然它不是最优雅的:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String fixMixedText(String text) {
String[] rows = text.split("\n");
String[] newRows = new String[rows.length];
String mString = "EE-Male";
String fString = "EE-Female";
String mRegex = "(.*)" + String.join("(.*)", mString.split("")) + "(.*)";
String fRegex = "(.*)" + String.join("(.*)", fString.split("")) + "(.*)";
for (int i = 0; i < rows.length; ++i) {
String[] cols = rows[i].split("\s{2,}"); // 2 or more whitespaces
assert(cols.length == 2);
String[] newCols = new String[2];
if (i == 0) {
newRows[i] = String.join("\t", cols);
// don't do any more processing than this for header row
continue;
}
Matcher m = Pattern.compile(fRegex).matcher(cols[1]);
boolean isFemaleMatch = m.find();
if (!isFemaleMatch) {
m = Pattern.compile(mRegex).matcher(cols[1]);
if (!m.find()) {
// no matches of either type
continue;
}
}
newCols[1] = isFemaleMatch ? fString : mString;
StringBuilder sb = new StringBuilder();
for (int matchIdx = 1; matchIdx <= m.groupCount(); ++matchIdx) {
// start loop at 1 because group(0) returns entire match
sb.append(m.group(matchIdx));
}
newCols[0] = cols[0] + sb.toString();
newRows[i] = String.join("\t", newCols);
}
return String.join("\n", newRows);
}
public static void main(String... args) {
String origText = "Name Type\n" +
"CHAK NO.162 NB PURANI AB AEDEI-Male\n" +
"EXCELLENT (ATTACH WITH GC EEET-)M JaEleHLUM\n" +
"PIND KHAN (TRATANI SAMAN EDE) -Female\n" +
"BASTI JAM SUMMAR PO RUKA NEEP-UMRale\n" +
"BASTI QAZIAN P/O KHANBEL AEE-Female\n" +
"GHAUS PUR MACHIAN PO RU EKEA-FNe PmUaRle\n" +
"NOOR MUHAMMAD CHEENR AELE W-FAemLAale\n" +
"PHATHI THARO KHELAN WAL EI E-Female\n" +
"WAH SAIDAN PO DAJAL RANJA ENE P-MUaRle";
String fixedText = fixMixedText(origText);
System.out.println(fixedText);
/*
Name Type
CHAK NO.162 NB PURANI ABADI EE-Male
EXCELLENT (ATTACH WITH GCET) JEHLUM EE-Male
PIND KHAN (TRATANI SAMAND) EE-Female
BASTI JAM SUMMAR PO RUKANPUR EE-Male
BASTI QAZIAN P/O KHANBELA EE-Female
GHAUS PUR MACHIAN PO RUKAN PUR EE-Female
NOOR MUHAMMAD CHEENRAL WALA EE-Female
PHATHI THARO KHELAN WALI EE-Female
WAH SAIDAN PO DAJAL RANJAN PUR EE-Male
*/
}
}
这是一个对我有用的解决方案 python:
categories = ["EE-Male", "EE-Female"]
#Create a dictionary with categories as keys and a regular expression as values.
categories_regex = {}
for category in categories:
categories_regex[category] = ".*" + ".*".join(list(category)) + ".*"
df['type'] = df.apply(lambda row : clean_categorical_var(row['type'], categories, categories_regex), axis = 1)
df['name'] = df.apply(lambda row : clean_name_var(row, 'type', 'name', categories, 'type2'), axis = 1)
df.drop(labels=["type"], axis=1, inplace = True)
df.rename(columns={"type2":"type"}, inplace = True)
并且我使用了以下三个辅助功能:
def clean_categorical_var(categorical_cell, categories, categories_regex):
'''
Cleans a categorical variable cell such as the type variable.
Input:
categorical_cell (str): content of the categorical cell tu clean
categories (list): list with the values (str) supposed to find on the
categorical column (ex. EE-Male, EE-Female)
categories_regex (dic): categories as keys and a regular expression for
each category as values.
Output:
cleaned_category (str): cleaned category without the mixed letters.
'''
cleaned_category = np.nan
for category in categories:
regex = categories_regex[category]
if re.match(regex, categorical_cell):
cleaned_category = category
return cleaned_category
def remove_letters(category, string_to_clean):
'''
Removes the letters on the category to recover the letters missing on the previous column.
Input:
categories (list): list with the values (str) supposed to find on the
categorical column (ex. EE-Male, EE-Female)
string_to_clean (str): categorical column dirty from where to recover the missing letters
Output:
cleaned_name (str): cleaned name with the letters that were missing at the end.
'''
category = list(category)
letters_index_to_delete = []
for n, letter in enumerate(list(string_to_clean)):
if letter == category[0]:
letters_index_to_delete.append(n)
del category[0]
if not category:
break
return letters_index_to_delete
def clean_name_var(row, categorical_column, name_column, categories,
categorical_column2='categorical_column_cleaned'):
'''
Cleans a name variable adding the letters that were missing at the end.
Input:
row (df.row): The row from the df to be cleaned
categorical_column (str): name of the column with the categories (ex. type)
name_column (str): name of the column to be cleaned
categories (list): list with the values (str) supposed to find on the
categorical column (ex. EE-Male, EE-Female)
categorical_column2 (str): name of the column with the categories cleaned (ex. type)
Output:
cleaned_name (str): cleaned name with the letters that were missing at the end.
'''
letters_index_to_delete = []
col_name_end = list(row[categorical_column])
if row[categorical_column] in categories:
return row[name_column]
for category in categories:
if row[categorical_column2] == category:
letters_index_to_delete = remove_letters(category, row[categorical_column])
break
for n in sorted(letters_index_to_delete, reverse=True):
del col_name_end[n]
return row[name_column]+''.join(col_name_end)
我正在尝试使用 python 和 tabula 包来抓取带有表格的 pdf。在某些情况下,提取的两列完全混淆了。我知道“类型”列应该只有这两个值:EE-Male 或 EE-Female。因此,我需要删除“类型”列中所有多余的字母,并按照它们出现的确切顺序将它们放在“名称”列的末尾。
Name Type
CHAK NO.162 NB PURANI AB AEDEI-Male
EXCELLENT (ATTACH WITH GC EEET-)M JaEleHLUM
PIND KHAN (TRATANI SAMAN EDE) -Female
BASTI JAM SUMMAR PO RUKA NEEP-UMRale
BASTI QAZIAN P/O KHANBEL AEE-Female
GHAUS PUR MACHIAN PO RU EKEA-FNe PmUaRle
NOOR MUHAMMAD CHEENR AELE W-FAemLAale
PHATHI THARO KHELAN WAL EI E-Female
WAH SAIDAN PO DAJAL RANJA ENE P-MUaRle
因此我需要这两列:
Name Type
CHAK NO.162 NB PURANI ABADI EE-Male
EXCELLENT (ATTACH WITH GCET) JEHLUM EE-Male
PIND KHAN (TRATANI SAMAND) EE-Female
BASTI JAM SUMMAR PO RUKANPUR EE-Male
BASTI QAZIAN P/O KHANBELA EE-Female
GHAUS PUR MACHIAN PO RUKAN PUR EE-Female
NOOR MUHAMMAD CHEENRAL WALA EE-Female
PHATHI THARO KHELAN WALI EE-Female
WAH SAIDAN PO DAJAL RANJAN PUR EE-Male
有什么建议吗?谢谢!
您想在哪里/如何做?由于 tabula 是一个 Java 库,我假设你想使用 Java。所以这是一种方法,虽然它不是最优雅的:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Main {
public static String fixMixedText(String text) {
String[] rows = text.split("\n");
String[] newRows = new String[rows.length];
String mString = "EE-Male";
String fString = "EE-Female";
String mRegex = "(.*)" + String.join("(.*)", mString.split("")) + "(.*)";
String fRegex = "(.*)" + String.join("(.*)", fString.split("")) + "(.*)";
for (int i = 0; i < rows.length; ++i) {
String[] cols = rows[i].split("\s{2,}"); // 2 or more whitespaces
assert(cols.length == 2);
String[] newCols = new String[2];
if (i == 0) {
newRows[i] = String.join("\t", cols);
// don't do any more processing than this for header row
continue;
}
Matcher m = Pattern.compile(fRegex).matcher(cols[1]);
boolean isFemaleMatch = m.find();
if (!isFemaleMatch) {
m = Pattern.compile(mRegex).matcher(cols[1]);
if (!m.find()) {
// no matches of either type
continue;
}
}
newCols[1] = isFemaleMatch ? fString : mString;
StringBuilder sb = new StringBuilder();
for (int matchIdx = 1; matchIdx <= m.groupCount(); ++matchIdx) {
// start loop at 1 because group(0) returns entire match
sb.append(m.group(matchIdx));
}
newCols[0] = cols[0] + sb.toString();
newRows[i] = String.join("\t", newCols);
}
return String.join("\n", newRows);
}
public static void main(String... args) {
String origText = "Name Type\n" +
"CHAK NO.162 NB PURANI AB AEDEI-Male\n" +
"EXCELLENT (ATTACH WITH GC EEET-)M JaEleHLUM\n" +
"PIND KHAN (TRATANI SAMAN EDE) -Female\n" +
"BASTI JAM SUMMAR PO RUKA NEEP-UMRale\n" +
"BASTI QAZIAN P/O KHANBEL AEE-Female\n" +
"GHAUS PUR MACHIAN PO RU EKEA-FNe PmUaRle\n" +
"NOOR MUHAMMAD CHEENR AELE W-FAemLAale\n" +
"PHATHI THARO KHELAN WAL EI E-Female\n" +
"WAH SAIDAN PO DAJAL RANJA ENE P-MUaRle";
String fixedText = fixMixedText(origText);
System.out.println(fixedText);
/*
Name Type
CHAK NO.162 NB PURANI ABADI EE-Male
EXCELLENT (ATTACH WITH GCET) JEHLUM EE-Male
PIND KHAN (TRATANI SAMAND) EE-Female
BASTI JAM SUMMAR PO RUKANPUR EE-Male
BASTI QAZIAN P/O KHANBELA EE-Female
GHAUS PUR MACHIAN PO RUKAN PUR EE-Female
NOOR MUHAMMAD CHEENRAL WALA EE-Female
PHATHI THARO KHELAN WALI EE-Female
WAH SAIDAN PO DAJAL RANJAN PUR EE-Male
*/
}
}
这是一个对我有用的解决方案 python:
categories = ["EE-Male", "EE-Female"]
#Create a dictionary with categories as keys and a regular expression as values.
categories_regex = {}
for category in categories:
categories_regex[category] = ".*" + ".*".join(list(category)) + ".*"
df['type'] = df.apply(lambda row : clean_categorical_var(row['type'], categories, categories_regex), axis = 1)
df['name'] = df.apply(lambda row : clean_name_var(row, 'type', 'name', categories, 'type2'), axis = 1)
df.drop(labels=["type"], axis=1, inplace = True)
df.rename(columns={"type2":"type"}, inplace = True)
并且我使用了以下三个辅助功能:
def clean_categorical_var(categorical_cell, categories, categories_regex):
'''
Cleans a categorical variable cell such as the type variable.
Input:
categorical_cell (str): content of the categorical cell tu clean
categories (list): list with the values (str) supposed to find on the
categorical column (ex. EE-Male, EE-Female)
categories_regex (dic): categories as keys and a regular expression for
each category as values.
Output:
cleaned_category (str): cleaned category without the mixed letters.
'''
cleaned_category = np.nan
for category in categories:
regex = categories_regex[category]
if re.match(regex, categorical_cell):
cleaned_category = category
return cleaned_category
def remove_letters(category, string_to_clean):
'''
Removes the letters on the category to recover the letters missing on the previous column.
Input:
categories (list): list with the values (str) supposed to find on the
categorical column (ex. EE-Male, EE-Female)
string_to_clean (str): categorical column dirty from where to recover the missing letters
Output:
cleaned_name (str): cleaned name with the letters that were missing at the end.
'''
category = list(category)
letters_index_to_delete = []
for n, letter in enumerate(list(string_to_clean)):
if letter == category[0]:
letters_index_to_delete.append(n)
del category[0]
if not category:
break
return letters_index_to_delete
def clean_name_var(row, categorical_column, name_column, categories,
categorical_column2='categorical_column_cleaned'):
'''
Cleans a name variable adding the letters that were missing at the end.
Input:
row (df.row): The row from the df to be cleaned
categorical_column (str): name of the column with the categories (ex. type)
name_column (str): name of the column to be cleaned
categories (list): list with the values (str) supposed to find on the
categorical column (ex. EE-Male, EE-Female)
categorical_column2 (str): name of the column with the categories cleaned (ex. type)
Output:
cleaned_name (str): cleaned name with the letters that were missing at the end.
'''
letters_index_to_delete = []
col_name_end = list(row[categorical_column])
if row[categorical_column] in categories:
return row[name_column]
for category in categories:
if row[categorical_column2] == category:
letters_index_to_delete = remove_letters(category, row[categorical_column])
break
for n in sorted(letters_index_to_delete, reverse=True):
del col_name_end[n]
return row[name_column]+''.join(col_name_end)