Optimizing VBA / Excel Macro Code (Finding Duplicates in Large Sheet)

RJGordon

I haven't really ever coded in VBA or anything of the sort except short little things in VB done years ago. This is my attempt at writing some code that will search through a excel sheet database of client accounts and search for possible duplicate accounts. Sadly, on the machine that I need to run this on, it can only handle about 3,500 entries without crashing Excel. I contribute this to both my code being horribly unoptimized as-well as the machine being slow.

What can be done to optimize the following code, and what best practices in VBA should I use in the future?

'Essentially, this loops through each row in the sheet
'For each row, it loops through every row after it, searching for duplicates of itself (skipping over a rows that have previously been marked as duplicates)
'Duplicates are defined by entries that meet a 'threshhold' of similarity
'The threshhold is defined as the number '5', first and last names are each two points, address and email address are one point
'That means that in order for an entry to meet the thresshold, the first and last name must be the same, and it must also have either the same address or email
'When duplicates are found, the duplicate column is marked as 'Yes' for that row, and the first occurence column is marked with a number defining the row number where the account first appeared

Sub Main():
    Dim lNameCol, fNameCol, addressCol, emailCol, duplicateCol, fOccurenceCol As String

    'Defines the column letters for the various data fields
    lNameCol = "A"
    fNameCol = "B"
    addressCol = "C"
    emailCol = "D" 
    duplicateCol = "E"   'The column where a entry/row will be marked as being a duplicate
    fOccurenceCol = "F"  'The column that contains the row number where a duplicate accounts first occurence was found

    Call Duplicates(lNameCol, fNameCol, addressCol, emailCol, duplicateCol, fOccurenceCol)
End Sub

'Gets number of rows in currently active sheet
Function RowCount():
    Application.ActiveSheet.UsedRange
    RowCount = Worksheets("Sheet1").UsedRange.Rows.Count
End Function

'Finds and labels duplicates
Sub Duplicates(ByVal lNameCol As String, ByVal fNameCol As String, ByVal addressCol As String, ByVal emailCol As String, ByVal duplicateCol As String, ByVal fOccurenceCol As String)
    Dim lRowCount As Integer
    lRowCount = RowCount()

    'Loops through each row in the sheet
    For i = 1 To lRowCount

        Dim duplicate, lastName, firstName, email, address As String

        'Sets these variables' values corresponding cell value in row 'i'
        'UCase capitilizes things to make entries case-insensitive
        duplicate = UCase(Range(duplicateCol & i).Value)
        lastName = UCase(Range(lNameCol & i).Value)
        firstName = UCase(Range(fNameCol & i).Value)
        email = UCase(Range(emailCol & i).Value)
        address = UCase(Range(addressCol & i).Value)

        'Checks to make sure row has not already been marked a duplicate, if it hasn't it continues
        If (StrComp(duplicate = "YES", vbTextCompare) = 1) Then

            'Loops through every row after the current row (row 'i')
            For n = (i + 1) To lRowCount

                'duplicateThreshold is an integer that defines the threshhold of similarity that rows need to have in order to be labeled a duplicate
                Dim duplicateThreshhold As Integer
                Dim lastName2, firstName2, email2, address2 As String

                duplicateThreshhold = 0

                'These are the entry variables for account entry at row 'n' being compared to the account entry at row 'i'
                lastName2 = UCase(Range(lNameCol & n).Value)
                firstName2 = UCase(Range(fNameCol & n).Value)
                email2 = UCase(Range(emailCol & n).Value)
                address2 = UCase(Range(addressCol & n).Value)

                'Adds 2 points to threshhold if first name is the same
                If lastName = lastName2 Then
                    duplicateThreshhold = duplicateThreshhold + 2

                End If

                'Adds 2 points to threshold if last name is the same
                If firstName = firstName2 Then
                    duplicateThreshhold = duplicateThreshhold + 2
                End If

                'The remaining two fields give 1 point each to the thresshold
                'As long as the sum of the points given by first and last name is always greater than half of the threshhold, first and last name will always be required

                If email = email2 Or address = address2 Then
                    duplicateThreshhold = duplicateThreshhold + 1
                End If

                If duplicateThreshhold > 4 Then
                   'Labels duplicate entries as duplicates
                    Range(duplicateCol & i).Value = "Yes"
                    Range(duplicateCol & n).Value = "Yes" 

                   'Labels duplicate entries with the first occurence of that entry
                    Range(fOccurenceCol & i).Value = i 'Labels first occurence account's row number
                    Range(fOccurenceCol & n).Value = i

                End If

            Next
        End If
    Next


End Sub
PeterT

Okay, so this is one of those problems that stuck in the back of my head, so I had to solve it (thanks a lot @RJGordon!). I ended up solving it two different ways - the first with nested loops and the second with hashed dictionaries. The second is a cleaner and faster algorithm, but I'll present both for thoroughness sake.

Nested Loops

As @JohnColeman pointed out, this method makes sense logically but scales terribly. It was easy enough to provide a list of all the duplicate rows for each record and has the advantage of tagging the first rows in the dataset. (The second solution below does not tag the initial record with the duplicates below, but you can solve for that as well if needed.)

Option Explicit

Sub test()
    MarkDuplicates ActiveSheet, 1, 2, 3, 4, 5, 6
End Sub

Sub MarkDuplicates(sh As Worksheet, lNameCol As Long, _
                   fNameCol As Long, addressCol As Long, _
                   emailCol As Long, duplicateCol As Long, _
                   fOccuranceCol As Long)
    Dim lastRow As Long
    Dim lastCol As Long
    Dim acctRange As Range
    Dim acctData As Variant
    Dim checkRow As Long
    Dim otherRow As Long
    Dim dupScore As Integer
    Dim dupList As String

    '--- determine the range of data and copy to a memory-based array
    lastRow = sh.Cells(sh.Rows.Count, "A").End(xlUp).Row
    lastCol = sh.Cells(1, sh.Columns.Count).End(xlToLeft).Column
    Set acctRange = sh.Range("A1").Resize(lastRow, lastCol)
    acctData = acctRange

    '--- nested loop to check each row against every other row
    For checkRow = 2 To lastRow
        dupList = ""
        For otherRow = 2 To lastRow
            dupScore = 0
            If otherRow <> checkRow Then
                If acctData(checkRow, lNameCol) = acctData(otherRow, lNameCol) Then
                    dupScore = dupScore + 2
                End If
                If acctData(checkRow, fNameCol) = acctData(otherRow, fNameCol) Then
                    dupScore = dupScore + 2
                End If
                If acctData(checkRow, addressCol) = acctData(otherRow, addressCol) Then
                    dupScore = dupScore + 1
                End If
                If acctData(checkRow, emailCol) = acctData(otherRow, emailCol) Then
                    dupScore = dupScore + 1
                End If
                If dupScore > 4 Then
                    dupList = dupList & otherRow & ","
                End If
            End If
        Next otherRow
        If Len(dupList) > 0 Then
            dupList = Left(dupList, Len(dupList) - 1)
            acctData(checkRow, duplicateCol) = "Yes"
            acctData(checkRow, fOccuranceCol) = dupList
        Else
            acctData(checkRow, duplicateCol) = ""
            acctData(checkRow, fOccuranceCol) = ""
        End If
    Next checkRow

    '--- copy the array back to the worksheet
    acctRange = acctData

    Set sh = Nothing
End Sub

Using Dictionaries

And I do mean dictionaries (plural). Because your duplicate score threshold can be reached with three different combinations of fields, your dictionary hash must test each combination. The dictionary key (hash) I've chosen is a concatenated string of the fields that, when tested, would indicate a duplicate record. This solution shows only a single loop with three dictionaries. If you wanted a list of all duplicate records found, then rewrite the code to create all three dictionaries in a single loop, then use a separate (not nested) loop to each record against each dictionary key and keep a running list of the dupes. (I kept it to a single loop for efficiency.)

Creating a single dictionary with a longer key (e.g. lastName+firstName+address+email) will cause you to have key collisions for records with duplicate ALL those fields, but you'll still have to find a way to test the other combinations. Someone far smarter than me may come up with a simpler way.

Option Explicit

Sub test()
    MarkDuplicates ActiveSheet, 1, 2, 3, 4, 5, 6
End Sub

Sub MarkDuplicates(sh As Worksheet, lNameCol As Long, _
                   fNameCol As Long, addressCol As Long, _
                   emailCol As Long, duplicateCol As Long, _
                   fOccuranceCol As Long)
    Dim lastRow As Long
    Dim lastCol As Long
    Dim acctRange As Range
    Dim acctData As Variant
    Dim acctDict1 As Dictionary
    Dim acctDict2 As Dictionary
    Dim acctDict3 As Dictionary
    Dim acctKey As String
    Dim checkRow As Long
    Dim otherRow As Long
    Dim dupScore As Integer
    Dim dupList As String

    '--- determine the range of data and copy to a memory-based array
    lastRow = sh.Cells(sh.Rows.Count, "A").End(xlUp).Row
    lastCol = sh.Cells(1, sh.Columns.Count).End(xlToLeft).Column
    Set acctRange = sh.Range("A1").Resize(lastRow, lastCol)
    acctData = acctRange

    Set acctDict1 = New Dictionary
    Set acctDict2 = New Dictionary
    Set acctDict3 = New Dictionary

    '--- build the initial dictionary
    '    for the key to trip as duplicate, there are three possible
    '    combinations to check, so we make three dictionaries and
    '    create keys as combinations of the fields
    For checkRow = 2 To lastRow
        '--- clear previous flags
        acctData(checkRow, duplicateCol) = ""
        acctData(checkRow, fOccuranceCol) = ""

        '--- dupe is lastname + firstname
        acctKey = acctData(checkRow, lNameCol) & acctData(checkRow, fNameCol)
        If Not acctDict1.Exists(acctKey) Then
            acctDict1.Add acctKey, checkRow
        ElseIf acctData(checkRow, duplicateCol) <> "Yes" Then
            acctData(checkRow, duplicateCol) = "Yes1"
            acctData(checkRow, fOccuranceCol) = acctDict1.Item(acctKey)
        End If

        '--- dupe is lastname + address + email
        acctKey = acctData(checkRow, lNameCol) & acctData(checkRow, addressCol) & _
                  acctData(checkRow, emailCol)
        If Not acctDict2.Exists(acctKey) Then
            acctDict2.Add acctKey, checkRow
        ElseIf acctData(checkRow, duplicateCol) <> "Yes" Then
            acctData(checkRow, duplicateCol) = "Yes2"
            acctData(checkRow, fOccuranceCol) = acctDict2.Item(acctKey)
        End If

        '--- dupe is firstname + address + email
        acctKey = acctData(checkRow, fNameCol) & acctData(checkRow, addressCol) & _
                  acctData(checkRow, emailCol)
        If Not acctDict3.Exists(acctKey) Then
            acctDict3.Add acctKey, checkRow
        ElseIf acctData(checkRow, duplicateCol) <> "Yes" Then
            acctData(checkRow, duplicateCol) = "Yes3"
            acctData(checkRow, fOccuranceCol) = acctDict3.Item(acctKey)
        End If
    Next checkRow

    '--- copy the array back to the worksheet
    acctRange = acctData

    Set sh = Nothing
End Sub

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Delete sheet contents but not macro button Excel Vba

From Dev

Optimizing VBA macro

From Dev

Optimizing VBA macro

From Dev

Excel VBA macro color code

From Dev

How to find a row in an Excel sheet using Excel Macro VBA?

From Dev

Excel VBA finding matching cell in another sheet in same book

From Dev

How to I change this Excel macro (VBA) to run on any sheet?

From Dev

Excel loop VBA Macro copy cells to new sheet

From Dev

How to restrict excel-vba macro to one sheet?

From Dev

Excel VBA, Code name appearing with sheet name

From Dev

Excel vba add code to sheet module programmatically

From Dev

VBA to put code programmatically in sheet-level macro

From Dev

Can using Option Explicit speed up a large macro in Excel VBA?

From Dev

Excel Large Function with Duplicates

From Dev

Excel VBA macro Script : Find value from Sheet 1 in Sheet 2 and Copy Value found

From Dev

Excel VBA - Creating a macro that inserts the active sheet name inside another sheet cell's forumula

From Dev

VBA Macro Works on sheet 1 but not on Sheet 2

From Dev

Finding the last cell in an excel sheet?

From Dev

Trying to write a an Excel macro to find a large text string and copy to another sheet

From Dev

Finding duplicates between 2 large tables

From Dev

Excel VBA - Removing duplicates

From Dev

Excel VBA Code to subtract the quantity from another sheet

From Dev

Excel - VBA code to add a column ID and then combine all sheets into a sheet

From Dev

Excel VBA - Error in code to copy to specific sheet, dependent on range

From Dev

Excel VBA: how can I restrict code execution to the sheet

From Dev

Excel VBA code to populate summary sheet with formula linking cells

From Dev

Excel not finding VBA Function

From Dev

Excel VBA Macro Code Inserting Unlimited Rows Without Stopping

From Dev

Add formula expression instead of values in VBA Macro excel code

Related Related

  1. 1

    Delete sheet contents but not macro button Excel Vba

  2. 2

    Optimizing VBA macro

  3. 3

    Optimizing VBA macro

  4. 4

    Excel VBA macro color code

  5. 5

    How to find a row in an Excel sheet using Excel Macro VBA?

  6. 6

    Excel VBA finding matching cell in another sheet in same book

  7. 7

    How to I change this Excel macro (VBA) to run on any sheet?

  8. 8

    Excel loop VBA Macro copy cells to new sheet

  9. 9

    How to restrict excel-vba macro to one sheet?

  10. 10

    Excel VBA, Code name appearing with sheet name

  11. 11

    Excel vba add code to sheet module programmatically

  12. 12

    VBA to put code programmatically in sheet-level macro

  13. 13

    Can using Option Explicit speed up a large macro in Excel VBA?

  14. 14

    Excel Large Function with Duplicates

  15. 15

    Excel VBA macro Script : Find value from Sheet 1 in Sheet 2 and Copy Value found

  16. 16

    Excel VBA - Creating a macro that inserts the active sheet name inside another sheet cell's forumula

  17. 17

    VBA Macro Works on sheet 1 but not on Sheet 2

  18. 18

    Finding the last cell in an excel sheet?

  19. 19

    Trying to write a an Excel macro to find a large text string and copy to another sheet

  20. 20

    Finding duplicates between 2 large tables

  21. 21

    Excel VBA - Removing duplicates

  22. 22

    Excel VBA Code to subtract the quantity from another sheet

  23. 23

    Excel - VBA code to add a column ID and then combine all sheets into a sheet

  24. 24

    Excel VBA - Error in code to copy to specific sheet, dependent on range

  25. 25

    Excel VBA: how can I restrict code execution to the sheet

  26. 26

    Excel VBA code to populate summary sheet with formula linking cells

  27. 27

    Excel not finding VBA Function

  28. 28

    Excel VBA Macro Code Inserting Unlimited Rows Without Stopping

  29. 29

    Add formula expression instead of values in VBA Macro excel code

HotTag

Archive