PDA

View Full Version : Remove Duplicates From Across Multipe Columns In A Single WorkSheet



cyphrevil
10-23-2013, 11:55 PM
Hi all,

I am in need of help, that's the short story.

I have almost 9 million cells with values in them and they occupy 17 columns in total (one sheet). Some of them take almost 1.04 mln rows, some 70k.

All I have to do is to remove duplicate values. I've tried advanced copy/paste/only-unique-values feature on excel but even though I'm working on nice Dell i7 8gb ram etc., I have found excel crashed in the morning.

Any help including vba will be very appreciated.

Many thanks in advance
T.

Excel Fox
10-24-2013, 09:27 AM
Do you want the duplicates in each column removed, or should the data in the entire sheet be unique? If it's for each column, try this


Sub ExcelFox()

Dim lng As Long
Dim wks As Worksheet

Set wks = Worksheets("NameOfSheetWithDuplicateValues")
For lng = 1 To 17
With wks
.Range(.Cells(1, lng), .Cells(.Rows.Count, lng).End(xlUp)).RemoveDuplicates Columns:=1, Header:=xlNo
End With
Next lng

End Sub


Note that this will only work for Excel 2007 and above.

cyphrevil
10-24-2013, 11:39 AM
They are already unique in each column, need to remove duplicates between columns now..

Thanks anyway.

Excel Fox
10-24-2013, 11:47 PM
Try this



Sub ExcelFox()

Dim lng As Long
Dim wks As Worksheet
Dim objDic As Object
Dim var As Variant
Dim varIndex As Variant
Dim lngRow As Long
Const clngSteps As Long = 100000

Set objDic = CreateObject("Scripting.Dictionary")
Set wks = Worksheets("NameOfSheetWithDuplicateValues")
For lng = 1 To 17
With wks
.Range(.Cells(1, lng), .Cells(.Rows.Count, lng).End(xlUp)).RemoveDuplicates Columns:=1, Header:=xlNo
var = .Range(.Cells(1, lng), .Cells(.Rows.Count, lng).End(xlUp)).Value2
.Range(.Cells(1, lng), .Cells(.Rows.Count, lng).End(xlUp)).Clear
For lngRow = 1 To UBound(var)
objDic.Item(var(lngRow, 1)) = 0
Next lngRow
End With
Next lng
Erase var
var = objDic.keys
Set objDic = Nothing
If wks.Parent.FullName <> wks.Parent.Name Then
wks.Parent.Save
End If
lng = 1
For lngRow = 1 To UBound(var) + 1 Step clngSteps
varIndex = Evaluate(clngSteps * (lng - 1) & "+ ROW(1:" & Application.Min(UBound(var) + 1, clngSteps) & ")")
Cells(1, lng).Resize(Application.Min(UBound(var) + 1, clngSteps)).Value = Application.Index(var, varIndex)
lng = lng + 1
Next lngRow
Erase varIndex

End Sub

cyphrevil
10-25-2013, 04:33 PM
Thank you so much for your effort. The script ran for 2 hours then crashed at below line:

Cells(1, lng).Resize(Application.Min(UBound(var) + 1, clngSteps)).Value = Application.Index(var, varIndex)

Error message said 'mismatch error'

also, the sheet has been cleared of all data - if it's of any help to you.

Kind Regards

Excel Fox
10-25-2013, 09:03 PM
What's your formula separator? Is it : or something else?

Also, can you test this on a much smaller sample.. maybe 10000 rows and 17 columns.... If it works, then I'll come up with something else. If it doesn't work, then something's wrong with the data.

Admin
10-26-2013, 12:47 AM
Hi

Here is a different method.


Option Explicit

Sub kTest()

Dim strAll As String
Dim i As Long
Dim r As Long
Dim d As Long
Dim fd As String
Dim fn As String
Dim objFS As Object
Dim objFile As Object
Dim adoConn As Object
Dim adoRset As Object

Const Block = 32000

strAll = "Temp"

For i = 1 To 5
d = 0
For r = 1 To 100000 Step Block
strAll = strAll & vbCrLf & Join(Application.Transpose(Cells(r, i).Resize(Application.Min(Block, Abs(100000 - d))).Value2), vbCrLf)
d = d + Block
Next
Next

fd = Environ("temp") & "\"
fn = "Test.txt"

Set objFS = CreateObject("scripting.filesystemobject")
Set objFile = objFS.opentextfile(fd & fn, 2, 1)

objFile.write strAll
objFile.Close

Set adoConn = CreateObject("ADODB.Connection")

adoConn.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};" & _
"Dbq=" & fd & ";Extensions=txt;"

Set adoRset = CreateObject("ADODB.Recordset")
adoRset.Open "SELECT [Temp] FROM [" & fn & "] GROUP BY [Temp]", adoConn, 3, 1, 1

d = adoRset.RecordCount

ActiveSheet.UsedRange.ClearContents

If d > Rows.Count Then
i = 1
While Not adoRset.EOF
Cells(1, i).CopyFromRecordset adoRset, 1000000
i = i + 1
Wend
Else
Range("a1").CopyFromRecordset adoRset
End If

adoRset.Close
adoConn.Close

Kill fd & fn

End Sub

cyphrevil
10-26-2013, 06:46 PM
What's your formula separator? Is it : or something else?

Also, can you test this on a much smaller sample.. maybe 10000 rows and 17 columns.... If it works, then I'll come up with something else. If it doesn't work, then something's wrong with the data.


On 10000 rows it removed all the content and then displayed the same error in the same line.

My formula separator seems to be , and not :

cyphrevil
10-26-2013, 06:52 PM
Hi,

This is regarding 'different method':

Tried the code, worked for about 20-30 seconds and left me with only column A populated to around 440k rows. Not just unique values cause I can't find some entries from the full 17 columns table.

Thank you for all your help as I still can't figure this out on my own :(

Admin
10-26-2013, 07:13 PM
Hi,

This is regarding 'different method':

Tried the code, worked for about 20-30 seconds and left me with only column A populated to around 440k rows. Not just unique values cause I can't find some entries from the full 17 columns table.

Thank you for all your help as I still can't figure this out on my own :(

My bad :(

I tested this code with 5 columns of data.

replace
For i = 1 To 5 with
For i = 1 To 17

cyphrevil
10-26-2013, 07:49 PM
Resulted with column A with 1mln rows and B with 143k rows. Ends with values starting with 'z' so seems to be working, yet I still can't find certain values from main table.

I'm sure you'll have some ideas.

Thank you.

Excel Fox
10-26-2013, 08:42 PM
I just noticed that my code isn't working when the length of text in any cell is greater than 255 characters. Does your data have any such values?

Admin
10-26-2013, 08:49 PM
Any particular type of values? Any leading/trailing space on values ?

cyphrevil
10-26-2013, 09:04 PM
Any particular type of values? Any leading/trailing space on values ?

Hi guys, to both of you:

The data are email addresses, this is an outcome of my routers logs. There should be nothing even remotely close to 255 characters length.
Thanks again.

Excel Fox
10-26-2013, 09:36 PM
I modified the code slightly, but not too different. It's working on 500 rows and 17 columns. Can you try this.

And here's the code.




Sub ExcelFox()

Dim lng As Long
Dim wks As Worksheet
Dim objDic As Object
Dim var As Variant
Dim varIndex As Variant
Dim lngRow As Long
Const clngLastColumn As Long = 17
Const clngSteps As Long = 100000

Set objDic = CreateObject("Scripting.Dictionary")
Set wks = Worksheets("NameOfSheetWithDuplicateValues")
For lng = 1 To clngLastColumn
With wks
.Range(.Cells(1, lng), .Cells(.Rows.Count, lng).End(xlUp)).RemoveDuplicates Columns:=1, Header:=xlNo
var = .Range(.Cells(1, lng), .Cells(.Rows.Count, lng).End(xlUp)).Value2
For lngRow = 1 To UBound(var)
objDic.Item(var(lngRow, 1)) = 0
Next lngRow
End With
Next lng
Erase var
var = objDic.keys
var = Application.Transpose(Application.Transpose(var))
Set objDic = Nothing
If wks.Parent.FullName <> wks.Parent.Name Then
wks.Parent.Save
End If
lng = 1
For lngRow = 1 To UBound(var) + Abs(LBound(var) = 0) Step clngSteps
varIndex = Application.Transpose(Evaluate(clngSteps * (lng - 1) & "+ ROW(1:" & Application.Min(UBound(var) + Abs(LBound(var) = 0), clngSteps) & ")"))
Cells(1, clngLastColumn + lng).Resize(Application.Min(UBound(var) + Abs(LBound(var) = 0), clngSteps)).Value2 = Application.Transpose(Application.Index(var, varIndex))
lng = lng + 1
Next lngRow
Erase varIndex

End Sub

cyphrevil
10-26-2013, 10:42 PM
I modified the code slightly, but not too different. It's working on 500 rows and 17 columns. Can you try this.

And here's the code.




Sub ExcelFox()

Dim lng As Long
...




It's grinding it, I'll let you know as soon as it's finished.

Admin
10-26-2013, 11:18 PM
Resulted with column A with 1mln rows and B with 143k rows. Ends with values starting with 'z' so seems to be working, yet I still can't find certain values from main table.

I'm sure you'll have some ideas.

Thank you.

Again I missed the last row part. In my test I hard coded last row as 100000. Give this a try.


Option Explicit

Sub kTest()

Dim strAll As String
Dim i As Long
Dim r As Long
Dim n As Long
Dim d As Long
Dim fd As String
Dim fn As String
Dim objFS As Object
Dim objFile As Object
Dim adoConn As Object
Dim adoRset As Object

Const Block = 65000

strAll = "Temp"

For i = 1 To 17
d = 0
n = Cells(Rows.Count, i).End(3).Row
If n = 1 Then n = Rows.Count
For r = 1 To n Step Block
strAll = strAll & vbCrLf & Join(Application.Transpose(Cells(r, i).Resize(Application.Min(Block, Abs(n - d))).Value2), vbCrLf)
d = d + Block
Next
Next

fd = Environ("temp") & "\"
fn = "Test.txt"

Set objFS = CreateObject("scripting.filesystemobject")
Set objFile = objFS.opentextfile(fd & fn, 2, 1)

objFile.write strAll
objFile.Close

Set adoConn = CreateObject("ADODB.Connection")

adoConn.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};" & _
"Dbq=" & fd & ";Extensions=txt;"

Set adoRset = CreateObject("ADODB.Recordset")
adoRset.Open "SELECT [Temp] FROM [" & fn & "] GROUP BY [Temp]", adoConn, 3, 1, 1

d = adoRset.RecordCount

ActiveSheet.UsedRange.ClearContents

If d > Rows.Count Then
i = 1
While Not adoRset.EOF
Cells(1, i).CopyFromRecordset adoRset, 1000000
i = i + 1
Wend
Else
Range("a1").CopyFromRecordset adoRset
End If

adoRset.Close
adoConn.Close

Kill fd & fn

End Sub

cyphrevil
10-27-2013, 12:55 AM
The previous one just ended with 'out of memory' message.

I'm going to reboot the machine and try the new code. Will update asap.

cyphrevil
10-27-2013, 02:29 AM
The previous one just ended with 'out of memory' message.

I'm going to reboot the machine and try the new code. Will update asap.


it's giving me error number 14 'out of string' message.

Admin
10-27-2013, 10:48 AM
Hi

Please DO NOT quote the entire post.

Try this one.


Option Explicit

Sub kTest()

Dim strAll As String
Dim i As Long
Dim r As Long
Dim n As Long
Dim d As Long
Dim fd As String
Dim fn As String
Dim objFS As Object
Dim objFile As Object
Dim adoConn As Object
Dim adoRset As Object
Dim Flg As Boolean

Const Block = 65000

fd = Environ("temp") & "\"
fn = "Test.txt"

Set objFS = CreateObject("scripting.filesystemobject")

strAll = "Temp"

For i = 1 To 17
d = 0
n = Cells(Rows.Count, i).End(3).Row
If n = 1 Then n = Rows.Count
For r = 1 To n Step Block
strAll = strAll & vbCrLf & Join(Application.Transpose(Cells(r, i).Resize(Application.Min(Block, Abs(n - d))).Value2), vbCrLf)
d = d + Block
Next
If i Mod 3 = 0 Then
If Flg Then
Set objFile = objFS.opentextfile(fd & fn, 8)
Else
Set objFile = objFS.opentextfile(fd & fn, 2, 0)
Flg = True
End If
objFile.write strAll
objFile.Close
strAll = vbNullString
End If
Next
If LenB(strAll) Then
Set objFile = objFS.opentextfile(fd & fn, IIf(Flg, 8, 2), IIf(Flg, 0, 1))
objFile.write strAll
objFile.Close
End If

Set adoConn = CreateObject("ADODB.Connection")

adoConn.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};" & _
"Dbq=" & fd & ";Extensions=txt;"

Set adoRset = CreateObject("ADODB.Recordset")
adoRset.Open "SELECT [Temp] FROM [" & fn & "] GROUP BY [Temp]", adoConn, 3, 1, 1

d = adoRset.RecordCount

ActiveSheet.UsedRange.ClearContents

If d > Rows.Count Then
i = 1
While Not adoRset.EOF
Cells(1, i).CopyFromRecordset adoRset, 1000000
i = i + 1
Wend
Else
Range("a1").CopyFromRecordset adoRset
End If

adoRset.Close
adoConn.Close

Kill fd & fn

End Sub

snb
10-27-2013, 09:24 PM
assumptions:

- data in sheet1
- row 1 doesn't contain fieldnames
- no empty rows in the usedrange
- 17 columns that contain data


Sub M_snb()
sn=sheet1.cells(1).currentregion.resize(,17)

with createobject("scripting.dictionary")
for j=1 to ubound(sn)
.item(join(application.index(sn,j,0)))=application .index(sn,j,0)
next
sheet2.cells(1).resize(.count,17)=application.inde x(.items,0,0)
end with
End Sub

or another method


Sub M_snb()
sn=sheet1.cells(1).currentregion.resize(,17)
c00=""

for j=1 to ubound(sn)
if instr(c00 & "|" ,"|" & application.index(sn,j,0) & "|") then
sn(j,1)=""
else
c00=c00 & "|" & application.index(sn,j,0)
end if
next

sheet1.cells(1).currentregion.resize(,17)=sn
sheet1.columns(1).specialcells(4).entirerow.delete
End Sub

cyphrevil
10-29-2013, 06:34 PM
Hi, only got another chance to get on it.

Unfortunately, same run-time error 14, out of string space.

cyphrevil
10-29-2013, 06:46 PM
Hi again,

In this case both methods ended up with error code 13, type mismatch.

Admin
10-29-2013, 07:06 PM
Hi, only got another chance to get on it.

Unfortunately, same run-time error 14, out of string space.

Give this a try.


Option Explicit

Sub kTest()

Dim strAll As String
Dim i As Long
Dim r As Long
Dim n As Long
Dim d As Long
Dim fd As String
Dim fn As String
Dim objFS As Object
Dim objFile As Object
Dim adoConn As Object
Dim adoRset As Object
Dim Flg As Boolean

Const Block = 65000

fd = Environ("temp") & "\"
fn = "Test.txt"

Set objFS = CreateObject("scripting.filesystemobject")

strAll = "Temp"

For i = 1 To 17
d = 0
n = Cells(Rows.Count, i).End(3).Row
If n = 1 Then n = Rows.Count
For r = 1 To n Step Block
strAll = strAll & vbCrLf & Join(Application.Transpose(Cells(r, i).Resize(Application.Min(Block, Abs(n - d))).Value2), vbCrLf)
d = d + Block
Next
If Flg Then
Set objFile = objFS.opentextfile(fd & fn, 8)
Else
Set objFile = objFS.opentextfile(fd & fn, 2, 0)
Flg = True
End If
objFile.write strAll
objFile.Close
strAll = vbNullString
Next
If LenB(strAll) Then
Set objFile = objFS.opentextfile(fd & fn, IIf(Flg, 8, 2), IIf(Flg, 0, 1))
objFile.write strAll
objFile.Close
End If

Set adoConn = CreateObject("ADODB.Connection")

adoConn.Open "Driver={Microsoft Text Driver (*.txt; *.csv)};" & _
"Dbq=" & fd & ";Extensions=txt;"

Set adoRset = CreateObject("ADODB.Recordset")
adoRset.Open "SELECT [Temp] FROM [" & fn & "] GROUP BY [Temp]", adoConn, 3, 1, 1

d = adoRset.RecordCount

ActiveSheet.UsedRange.ClearContents

If d > Rows.Count Then
i = 1
While Not adoRset.EOF
Cells(1, i).CopyFromRecordset adoRset, 1000000
i = i + 1
Wend
Else
Range("a1").CopyFromRecordset adoRset
End If

adoRset.Close
adoConn.Close

Kill fd & fn

End Sub

cyphrevil
10-29-2013, 07:26 PM
No matter what I do it keeps giving me 'file not found' error code 53.

Admin
10-29-2013, 08:07 PM
Hi

Replace this line
Set objFile = objFS.opentextfile(fd & fn, 2, 0)

with


Set objFile = objFS.opentextfile(fd & fn, 2, 1)

cyphrevil
10-29-2013, 11:04 PM
YES, IT WORKS!!

Runs for about 5 minutes and results with nearly 2.8mln unique cells. Had to run it few times to check if every result is identical.
There can be no empty cells in any of the columns though - I mean until last cell used in the column.

Thank you thank you thank you

Admin
10-30-2013, 08:19 AM
You are welcome ! :cheers: