正则表达式以确定的顺序删除第一次出现的字母

Question

我正在尝试使用 python 和 tabula 包来抓取带有表格的 pdf。在某些情况下，提取的两列完全混淆了。我知道“类型”列应该只有这两个值：EE-Male 或 EE-Female。因此，我需要删除“类型”列中所有多余的字母，并按照它们出现的确切顺序将它们放在“名称”列的末尾。

Name                        Type
CHAK NO.162 NB PURANI AB    AEDEI-Male
EXCELLENT (ATTACH WITH GC   EEET-)M JaEleHLUM
PIND KHAN (TRATANI SAMAN    EDE) -Female
BASTI JAM SUMMAR PO RUKA    NEEP-UMRale
BASTI QAZIAN P/O KHANBEL    AEE-Female
GHAUS PUR MACHIAN PO RU     EKEA-FNe PmUaRle
NOOR MUHAMMAD CHEENR        AELE W-FAemLAale
PHATHI THARO KHELAN WAL     EI E-Female
WAH SAIDAN PO DAJAL RANJA   ENE P-MUaRle

因此我需要这两列：

Name                                  Type
CHAK NO.162 NB PURANI ABADI           EE-Male
EXCELLENT (ATTACH WITH GCET) JEHLUM   EE-Male
PIND KHAN (TRATANI SAMAND)            EE-Female
BASTI JAM SUMMAR PO RUKANPUR          EE-Male
BASTI QAZIAN P/O KHANBELA             EE-Female
GHAUS PUR MACHIAN PO RUKAN PUR        EE-Female
NOOR MUHAMMAD CHEENRAL WALA           EE-Female
PHATHI THARO KHELAN WALI              EE-Female
WAH SAIDAN PO DAJAL RANJAN PUR        EE-Male

有什么建议吗？谢谢！

Answer 1

您想在哪里/如何做？由于 tabula 是一个 Java 库，我假设你想使用 Java。所以这是一种方法，虽然它不是最优雅的：

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {

    public static String fixMixedText(String text) {
        String[] rows = text.split("\n");
        String[] newRows = new String[rows.length];

        String mString = "EE-Male";
        String fString = "EE-Female";

        String mRegex = "(.*)" + String.join("(.*)", mString.split("")) + "(.*)";
        String fRegex = "(.*)" + String.join("(.*)", fString.split("")) + "(.*)";


        for (int i = 0; i < rows.length; ++i) {
            String[] cols = rows[i].split("\s{2,}"); // 2 or more whitespaces
            assert(cols.length == 2);
            String[] newCols = new String[2];

            if (i == 0) {
                newRows[i] = String.join("\t", cols);
                // don't do any more processing than this for header row
                continue;
            }
            
            Matcher m = Pattern.compile(fRegex).matcher(cols[1]);

            boolean isFemaleMatch = m.find();

            if (!isFemaleMatch) {
                m = Pattern.compile(mRegex).matcher(cols[1]);
                if (!m.find()) {
                    // no matches of either type
                    continue;
                }
            }

            newCols[1] = isFemaleMatch ? fString : mString;
            StringBuilder sb = new StringBuilder();
            for (int matchIdx = 1; matchIdx <= m.groupCount(); ++matchIdx) {
                // start loop at 1 because group(0) returns entire match
                sb.append(m.group(matchIdx));
            }
            newCols[0] = cols[0] + sb.toString();
            newRows[i] = String.join("\t", newCols);
        }

        return String.join("\n", newRows);
    }

    public static void main(String... args) {

        String origText = "Name                        Type\n" +
                "CHAK NO.162 NB PURANI AB    AEDEI-Male\n" +
                "EXCELLENT (ATTACH WITH GC   EEET-)M JaEleHLUM\n" +
                "PIND KHAN (TRATANI SAMAN    EDE) -Female\n" +
                "BASTI JAM SUMMAR PO RUKA    NEEP-UMRale\n" +
                "BASTI QAZIAN P/O KHANBEL    AEE-Female\n" +
                "GHAUS PUR MACHIAN PO RU     EKEA-FNe PmUaRle\n" +
                "NOOR MUHAMMAD CHEENR        AELE W-FAemLAale\n" +
                "PHATHI THARO KHELAN WAL     EI E-Female\n" +
                "WAH SAIDAN PO DAJAL RANJA   ENE P-MUaRle";

        String fixedText = fixMixedText(origText);
        System.out.println(fixedText);

        /*
        Name    Type
        CHAK NO.162 NB PURANI ABADI EE-Male
        EXCELLENT (ATTACH WITH GCET) JEHLUM EE-Male
        PIND KHAN (TRATANI SAMAND)  EE-Female
        BASTI JAM SUMMAR PO RUKANPUR    EE-Male
        BASTI QAZIAN P/O KHANBELA   EE-Female
        GHAUS PUR MACHIAN PO RUKAN PUR  EE-Female
        NOOR MUHAMMAD CHEENRAL WALA EE-Female
        PHATHI THARO KHELAN WALI    EE-Female
        WAH SAIDAN PO DAJAL RANJAN PUR  EE-Male
        */
    }
}

Answer 2

这是一个对我有用的解决方案 python:

categories = ["EE-Male", "EE-Female"]
#Create a dictionary with categories as keys and a regular expression as values.
categories_regex = {}
for category in categories:
    categories_regex[category] = ".*" + ".*".join(list(category)) + ".*"

df['type'] = df.apply(lambda row : clean_categorical_var(row['type'], categories, categories_regex), axis = 1) 
df['name']  = df.apply(lambda row : clean_name_var(row, 'type', 'name', categories, 'type2'), axis = 1) 
df.drop(labels=["type"], axis=1, inplace = True) 
df.rename(columns={"type2":"type"}, inplace = True)

并且我使用了以下三个辅助功能：

def clean_categorical_var(categorical_cell, categories, categories_regex):
    '''
    Cleans a categorical variable cell such as the type variable. 
    Input:
        categorical_cell (str): content of the categorical cell tu clean
        categories (list): list with the values (str) supposed to find on the
                           categorical column (ex. EE-Male, EE-Female)
        categories_regex (dic): categories as keys and a regular expression for 
                                each category as values.
    Output:
        cleaned_category (str): cleaned category without the mixed letters.
    '''
    cleaned_category = np.nan
    for category in categories:
        regex = categories_regex[category]
        if re.match(regex, categorical_cell):
            cleaned_category = category

    return cleaned_category


def remove_letters(category, string_to_clean):
    '''
    Removes the letters on the category to recover the letters missing on the previous column. 
    Input:
        categories (list): list with the values (str) supposed to find on the 
                           categorical column (ex. EE-Male, EE-Female)
        string_to_clean (str): categorical column dirty from where to recover the missing letters
    Output:
        cleaned_name (str): cleaned name with the letters that were missing at the end.
    '''
    category = list(category) 
    letters_index_to_delete = []
    for n, letter in enumerate(list(string_to_clean)):
            if letter == category[0]:
                letters_index_to_delete.append(n)
                del category[0]
                if not category:
                    break
    return letters_index_to_delete


def clean_name_var(row, categorical_column, name_column, categories, 
                   categorical_column2='categorical_column_cleaned'):
    '''
    Cleans a name variable adding the letters that were missing at the end. 
    Input:
        row (df.row): The row from the df to be cleaned
        categorical_column (str): name of the column with the categories (ex. type)
        name_column (str): name of the column to be cleaned
        categories (list): list with the values (str) supposed to find on the 
                           categorical column (ex. EE-Male, EE-Female)
        categorical_column2 (str): name of the column with the categories cleaned (ex. type)
    Output:
        cleaned_name (str): cleaned name with the letters that were missing at the end.
    '''

    letters_index_to_delete = []
    col_name_end = list(row[categorical_column])
    if row[categorical_column] in categories:
        return row[name_column]
    for category in categories:
        if row[categorical_column2] == category:
            letters_index_to_delete = remove_letters(category, row[categorical_column])
            break 
    for n in sorted(letters_index_to_delete, reverse=True):
        del col_name_end[n]

    return row[name_column]+''.join(col_name_end)

正则表达式以确定的顺序删除第一次出现的字母

Regular expression to remove first occurrence of letters in a determined order

regex

pdf-scraping

tabula