Probably you need to do something like replace Tsymbol = trim(itrim(Tsymbol)) in dataone. Second, why is Tsymbol 8 characters in dataone but only five characters in datatwo? If the variable in dataone is padded with blanks, and the one in datatwo is not, then the merge will not work as you expect. (And variable names in Stata are case sensitive, the merge will only work if they agree exactly.) For your merge to work as intended, the match variables must be identical in both data sets. Security Leasing Corporation Ltd.First, you have a variable Tsymbol in dataone, but the nearest equivalent in datatwo is tysbol. NATIONAL BANK OF PAKISTAN national bank of pakistan | I shall then explain what actually the code does. In the following code box, the code perform the first step as discussed above. Now that we have created the two datasets, let’s start the process step by step. * Copy the following code and run from Stata do editor To create the two dataset, we can copy and paste the following code to Stata do editor and run it. We shall merge the data_memory into data_file using variable name as the merging criterion. The data_file has two variables, name and symbol. The other data set will be saved to a file, we shall call it data_file. One of the dataset will remain in the Stata memory, we shall call it data_memory. From that data, we shall create two datasets. In this example, we are going to use the same data as given in the above table. Let’s use a simple example to implement what we have read so far. This part of the problem is easy to handle as we can use the function lower() to convert all the names to lower cases and then use it in the merging process. The second part of the problem is the dissimilarity in the capitalization of the names. We shall stop at a point where the data seems to have a reasonable number of similar observations based on the initial characters. The iterative process will continue in the same fashion, and each time we need to pay more attention to the merged data to identify any incorrect merges as explained in step 2.Ĥ. Again, we shall retain the successfully merged records, append it to the already saved successful merged data, and delete it from the initial dataset, just we did in the first step above.ģ. So using the first 29 characters of the two variables, we shall proceed to merge. “THE BANK OF “, therefore, if we merge the two data sets using only the first 12 characters, we shall incorrectly merge THE BANK OF KHYBER into THE BANK OF PUNJAB THE BANK OF KHYBER For example, consider the following two records where the first 12 characters are exactly the same in both the records i.e. But this time, we need to be careful as we reduce the extraction of the initial number of characters, the chances of matching incorrect records increases. The idea is that if two variables did not have the first 30 characters in common, they might have the first 29 characters in common. This time we shall extract one character less than what we used in the preceeding step. In the next iteration, we shall further process those records which did not merge. We shall discuss it further as we proceed in this article.Ģ. Please note that the relevant Stata function is substr() for extracting a given number of characters from a variable. In this first step, we shall normally start with extracting a large number of characters, for example, up to 30 characters in the case of row 2 of the above table. Also, we shall delete, from the initial file, those records which were successfully merged, and further process those that did not merge. If the merge succeeds, we shall save the merged data separately. we start with extracting the first n-number of characters from both the key variables in the two datasets and merge using the extracted (truncated) variables. Instead, we shall take the iterative path where:ġ. So how exactly are we going to find the matching number of characters in each case? We shall not do that. If we count these characters, they are 30. For example, in the second row of the above table, “The Pakistan General Insurance” part is similar in both the tables. Given that, we can split the problem into two parts.Įxtract the first few characters that are similar in both the dataset and merge the data using those similar characters. However, there is a general patterns of similarity in the first few characters, starting from left to right. The above table shows that the company names not only differ in terms of different number of characters but also in terms of capitalization.
0 Comments
Leave a Reply. |