Intel Go Parallel Weblog 2009-11-23T13:43:42Z tag:,2009:/108 Movable Type Copyright (c) 2009, ghillar Detecting Scalability Problems With Intel Parallel Universe Portal 2009-11-23T13:43:42Z 2009-11-23T03:53:24Z tag:,2009:/108.52888 2009-11-23T03:53:24Z You already know that achieving a linear speedup as the number of cores increases in real life parallelized applications is indeed very difficult. However, sometimes, the multicore scalability of certain algorithms for existing multicore systems could be worse than expected.... ghillar gastonhillar@hotmail.com Parallel Tasking You already know that achieving a linear speedup as the number of cores increases in real life parallelized applications is indeed very difficult. However, sometimes, the multicore scalability of certain algorithms for existing multicore systems could be worse than expected. The overhead and the bugs introduced by concurrency could bring really unexpected scalability problems when the number of cores increases. Intel can help you with a free service in the cloud.]]> You don't have access to a system with an Intel Xeon X5560 CPU running at a maximum clock speed of 2.80 GHz and offering 8 Hyperthreaded physical cores (16 logical cores, 16 hardware threads). However, you want to test the multicore scalability for an algorithm as the number of cores increases, up to 16 logical cores. If your application is written in C++, you can compile it in 32-bits to run on Windows and the total time needed to run it using 1; 2; 4; 8 and 16 logical cores is less than 1 minute, you can use the new scalability service offered by Intel Parallel Universe Portal. If you already own a system with 16 logical cores, you can skip this post.

It is very easy to use this new service. You just have to follow a few steps and you'll be able to check the multicore scalability of your application up to 16 logical cores. I'm going to explain you how to use this service step-by-step.

Create an application without the need of user interaction. Make the necessary changes to make sure your application will try to use as many logical cores as available in the underlying hardware. This way, you'll be able to test its multicore scalability. Once you obtain the first results, you will be able to make the necessary changes to test other specific scenarios.

If the application needs additional files or DLLs, copy all the necessary files to a new folder and create a ZIP file including the EXE file and all these additional files. For example, if the application is myparallelapp.exe and it needs the libiomp5md.dll DLL, your ZIP file should include both files. So far, the service doesn't support .NET applications, they must be unmanaged Windows C++ apps.

Enter Intel Parallel Universe Portal, login with your account and click on Start Here.

Enter a name to identify your session. Click on the Upload button, choose the previously created ZIP file and the file will begin uploading, as in Figure 1:

Figure 1: Uploading the ZIP file with the application to run and all the necessary additional files.

Click on the Next button. It is time to configure your job. You have to select the command line to run, the executable file and the arguments, as in Figure 2:

Figure 2: Configuring the job to run.

You can leave the arguments textbox blank if the application doesn't need arguments. Click on the Next button. The Web site will offer a summary with the session name, the uploaded zip file name and the command line to run. You have to check whether everything is specified as expected, as in Figure 3:

Figure 3: Checking the job's summary.

If everything is okay, you have to click on the Submit button. This way, the job will enter in a queue. The Web page will show your job and it will refresh the information about it. However, you can also wait for an e-mail with information about your job. If something goes wrong, you'll receive an e-mail with a link to a Web page with more information about the errors. The most common problem is that the application can take more than 1 minute to run with all the different configurations. As aforementioned, there's a 1 minute time limit. The other usual problem is an unsatisfied dependency, like a missing DLL that the application needs to run. Remember to include everything your app needs in the ZIP file. The error report usually includes the name of the unsatisfied dependencies.

If the application could run without problems using 1; 2; 4; 8 and 16 logical cores in less than 1 minute, you'll receive an e-mail with a link to a Web page with an easy to understand scalability report. The report offers both tables and graphs with the following information for the different number of logical cores used:

  • The elapsed time in seconds.
  • The average concurrency.

The following picture shows part of the report created by a job with easy to detect multicore scalability problems. As you can see, the application takes less time to run when it moves from 1 to 2 logical cores. However, when it runs with 4; 8 and 16 logical cores, there's no speedup. In fact, when it tries to take advantage of 16 logical cores, its performance is really bad.

Figure 4: Tables and graphs with the elapsed time in seconds and the average concurrency for an application with multicore scalability problems.

The application has a very important multicore scalability problem. It cannot achieve significant speedups when the number of cores is greater than 2. Besides, it introduces a very big overhead when the number of cores is 16. There is a very important overhead to work with 16 logical cores and the parallelized algorithms are taking more time than the time needed to complete the mission with just one logical core. Believe me; this situation is more common than expected. Redesigning serial algorithms in order to take advantage of multicore is a complex task. Redesigning them in order to scale as the number of cores is greater than 4 is even a more complex task. The manycore challenge is round the corner.

Besides, the report offers two additional graphs with the following information for the different number of logical cores used:

  • The performance improvement, considering the time taken to complete the job with the different number of logical cores.
  • The throughput scalability, explaining how the throughput scales as the number of logical cores increases.

Figure 5 shows the two graphs generated by the same job with multicore scalability problems. As you can see, the application increases both performance and throughput when it moves from 1 to 2 logical cores. However, when it runs with 4; 8 and 16 logical cores, throughput scalability decreases. The performance improvement graph shows two lines, a black line with the actual time needed to run the job with the different number of logical cores and a red line with the optimal time. In this case, the actual time needed to run the application with 16 logical cores is around 600 times the optimal time.

Figure 5: Graphs with the performance improvement and the throughput scalability for an application with multicore scalability problems.

As you can see, the reports are very easy to understand. This application is not able to scale beyond 2 logical cores.

For more information, you can read also check Intel Parallel Universe Portal Forums

You can also read the very informative post "Beware of Those That Claim Linear Performance Increases" by Markus Levy

If you want to use virtualization to test multicore scalability, you can read my previous post Using VirtualBox 3.0 Virtualization Software to Measure Multicore Scalability.]]> Just Say No To SFAQL Parallelism 2009-11-17T20:36:56Z 2009-11-17T04:16:12Z tag:,2009:/108.52789 2009-11-17T04:16:12Z

I know, I know, a lot of folks out there are big subscribers to the 'just-get-'er -done' school of software maintenance and development. The idea of sitting somewhere while a design group is doing its work is just plain torture.... chughes ctestlabs@ctestlabs.org Multicore Moments I know, I know, a lot of folks out there are big subscribers to the 'just-get-'er -done' school of software maintenance and development. The idea of sitting somewhere while a design group is doing its work is just plain torture. It feels like a waste of time and money. Somehow there's always a fire that demands that we code now and capture the design later. ]]> I'm sure the valiant souls that came up with agile software development had very good intentions and are probably supported by some pretty reasonable theory but if they could only see how twisted and perverse agile development can become in practice, especially in software that is destined for parallel and multicore processor environments they might have refrained. Ship now, patch soon to follow. We need it in production today! Send the customer an e-mail explaining a work around for the defect. We can't wait two weeks we've got to get it out on the web now. Just post the browser requirements. If they want access to our site they'll upgrade!

The complexity of parallel programming doesn't seem to deter the 'just-get-er-done' crowd not one bit. In some instances the parallel programming requirement becomes a simple budget justification for buying a license or two of some multicore software tool X from Vendor Y that is guaranteed to maximize the optimal concurrent capacity of the N-core servers and 'speed the development time of the entire team'. I'm reminded of words of Sir Walter Scott (or was it Capt. Montgomery Scott), 'Oh what a tangled web we weave when first we practice to deceive'. All we can say to our software development brethren is to just say no when asked to produce software in a process that shortcuts, leaves out, demeans, belittles, or otherwise stifles the proper design efforts necessary to produce high quality, accurate, correct, reliable and safe software. If multicore will make a good design run faster it will make a poor or bad design crash faster still. Don't get us wrong. We do believe in using the right tool for the job. But we've learned (through experience) that the tool should not be used as a replacement or shortcut for the due diligence of design and software engineering. Quadcore is quickly becoming the norm. Soon it will be 8 cores and 16 cores before you know it. There is already a mismatch between the software design and the new multicore paradigm. As the number of cores increase the mismatch becomes more apparent. There is the inevitable pressure to 'just-get-er-done' and bring the software design magically inline with the new reality of multicore processors. The most seductive way to do this is just to grab one of the new tool sets out there and just have at it. While this approach will actually get some traction in a couple of areas, in general those who fall into this very seductive trap are missing the bigger picture. We're on the road to massive parallelism at the hardware level but most of our commonly used design metaphors and development vocabulary lexicons are tied to decades of sequential programming techniques and at best generation one parallel programming techniques. In general, we lack the design vocabulary to communicate in the context of massively parallel execution contexts.

Once it is determined parallel programming or concurrency requirements are demanded by legacy or new software development efforts. If it is not already present, a new vocabulary, a new set of design concepts, a new set of testing concepts, in short a whole new development culture that is 'concurrency-aware' has to adopted and shared by the development team or group. Concurrency, parallelism, threading, etc., will impact, design documentation, implementation documentation, group communication, testing techniques, debugging techniques, legacy library usage, 3rd party component integration, and so on. And once the team or group adopts this culture it's a long term proposition. Inertia will set in and then no one will be interested in switching to some new concurrency paradigm. This is why teams and groups that have not yet taken the parallel programming plunge and are in the process of evaluating their needs and performing the requisite system analysis should proceed with caution. Many tool sets are accompanied by a canned philosophy that may or may not be what the doctor ordered. SFAQL (Shoot First Ask Questions Later) Parallelism can introduce many dead-ends and false starts to a team just entering parallel programming fray. Multicore and soon to be massively multicore computers are here to stay. Design philosophies and tool sets should be thoughtfully and carefully chosen. The software development, software engineering, and the computer science communities are undergoing a number of paradigm shifts at the moment. A forward looking software development group will really take pause and consider the big picture and then put together a short range, mid-range, and long-range plan that integrates the new computer architectures.

We are convinced that the paradigm shift necessary to truly take advantage of massive parallelism will not come in the form of some nifty vendor tool that hides or encapsulates parallelism from the application layer or application developer. While compiler improvements that can transform some sequential code to parallel code will be and are welcomed, these improvements will be by definition negligible when compared to problem space and solution set space complexity. Of course, we're also grateful for the tools that are attempting to hide interprocess communication, control, mutexes, semaphores, and locking. And for those of us implementing various types of servers that can utilize those tools, hooray! But the real paradigm shift will have to come at a level higher than coding or coding tools, caveat emptor. If we zoom out and look at software forest instead of the software tree, we will see that software is crossing a threshold of complexity and complication that is beyond our commonly used software maintenance and development paradigms. As this complexity clashes with the trend in multicore processors it becomes painfully obvious that a significant paradigm shift is in order with respect to software design and maintenance metaphors. Tracey and I are putting all of our eggs into the basket of major breakthroughs at the software/system design and problem-solution modeling levels. The current prevailing design models are simply not sufficient to scale even with the nifty vendor solutions that make concurrency transparent to the application. To give you a clearer understanding of where we are coming from (and consequently where we are going), consider our simple model of a successful system/software design shown in Figure 1.

A successful system/software design will be an intersection of the problem space, the solution set space, and a knowledge space. For convenience, we will call this intersection S. While this might be obvious to some it may not be to others. S is where the action is or shall we say S is where the real innovation and paradigmatic breakthroughs will necessarily come from. To put it simply, the problem space represents the complete initial state of affairs in the problem domain. The problem space could be understood as a model that captures the world that describes the problem domain scenario. The solution set space is the set of good and (for our purposes) bad solutions to the problem captured in the problem space. The knowledge space represents the knowledge that is available to bring to bare on the problem space and on navigating the solution set space. Please keep in mind that we are presenting a oversimplified notion of problem space, solution set space and knowledge space. I can just see the e-mails now chastising us for what we've left out.

To give an example of what we mean by problem space, solution set space, and knowledge space, we'll resurrect the 19 emails problem that we had. In this problem the problem space consisted of the fact that we had 19 emails received over several years and that for reasons unavailable we had no time or date stamp information for the e-mails. In fact we had no type of information that could put the 19 e-mails in the proper chronological order. Further, the problem space would include the fact that we needed to reconstruct the chronological order that the e-mails were sent in because the e-mails plus the proper ordering contain sensitive information that could only be understood if the proper chronological order of the e-mails was established. The problem space would also include the fact that we were pressed for time in identifying the correct sequence of the e-mails. All of this together describe a simple model of the problem space.

For this e-mail problem we simplified the solution set space and we described it as 19! or factorial(19). This amounted to 121,645,100,408,832,000 possible arrangements of the 19 emails. Since in the worst case scenario all 121,645,100,408,832,000 might have to be considered in order to identity the correct chronological order of the 19 e-mails.

The knowledge space includes deep natural language processing capabilities of a software agent, as well as the ability to generate permutations. The knowledge space in this case also consists of robust domain specific ontologies (more on that later). More importantly the knowledge space includes deep knowledge of the notions of parallel search and cooperative concurrent problem solving. Somewhere in the intersection of these three spaces (problem, solution set, knowledge) is a successful software/system design that we conveniently call S.

Concurrency and multicore architectural issues are actually resolved during the detailed expansion and examination of S. It is important to note that the concurrency design is thoroughly dealt with long before implementation tools or products are even considered. Long before we talk about debugging tools, watching multiple thread execute, and monitoring process pools using tool X, parallelism concerns must be dealt with at the design level. Make no mistake about it, if concurrency is not effectively dealt with at the design level (the S level) then whatever software is produced will be unreliable, impossible to maintain, and impossible to extend. Now the question is what metaphors, paradigms, solution models, and design tools do we have that deals with concurrency and parallelism at the design level. Are these design tools, metaphors, and paradigms sufficient for the impending clash between the on coming software complexity and the trend in multicore execution environments?

Let's spin it around at look at from another vantage point. If we consider any significant problem that we are trying to solve using software and software agents then we can look at a line with two end points. At one end is the unsolved problem and at the other end is the solved problem. Between the unsolved problem and the solved problem is something that we'll simply call 'work'. To drill down on this we'll go back to our 19 e-mails problem. At one end point we have 19 e-mails in some potentially random order. At the other end point we have the 19 e-mail in the proper chronological order. At issue is how much work will it require to put the e-mails in chronological order and what kind of work will it require. How long will the work take? Do we even know how to do the work in the first place? What is an efficient method of doing the work? What is an inefficient method of doing the work? How much effort is needed to do the work? Figure 2 gives us a picture of this and how it relates to our three spaces.

Now in our 19 e-mails problem the nature of the work may include search, inference, natural language processing, probabilistic reasoning, and symbolic processing. Now we have classified the activity between the two end points in Figure 2 as work. How we characterize this work in the context of getting to the solution has everything to do with whether concurrency is necessary and if it is how it will be used. Beyond this, what problem solving tools do we have at the design level? Keep in mind we are only talking about the design level here.

Our use of parallelism and concurrency is constrained by the tools, paradigms, and vocabulary we have at the design level. At the end of the day if we cannot convincingly and effectively describe the concurrency required to reach our solution at the design level then the software engineering effort is in jeopardy. Concurrency may introduce itself in the requirements in many ways. It may be an explicit statement in the requirements that some set of components must operate simultaneously and must communicate synchronously or asynchronously. That is concurrency is natural attribute and requirement of problem space and solution set space. Concurrency may introduce itself as a result of nature of the solution model. For example I may use the blackboard model of problem solving to deal with the deep natural language processing requirements that are present in my 19 e-mails problem. The blackboard model just happens to be effective with solving certain kinds of natural language processing. The fact that it includes parallelism or promotes concurrency is secondary. Concurrency may introduce itself because the nature of the problem is sufficiently complex or complicated that concurrent divide-n-conquer solutions are the most natural fit. Concurrency may also introduce itself into the design because time constraints requirements and size of the search space realities suggest designs that attempt to find solutions on several fronts simultaneously.

What we are suggesting here is that the problem space will either require or insinuate concurrency in the solution model. The solution set space may be so partitioned that concurrency is the natural way to navigate it. In other words, concurrency/ parallelism will be an artifact of the problem space, the solution set space, or both. So we manage and model concurrency at the design level. The system/software requirements will typically have a performance constraint that must be met by the hardware in order for the software to be considered successful. For example, in our 19 e-mail problem it's absolutely crucial that we get back the correct chronological order of the emails in less than 30 seconds. So in the worst case scenario we have a 100 quadrillion or so possible ordering of the emails, we have less than 30 seconds to find the right one. If I know one agent can evaluate one ordering every 3000 seconds then I know one agent working by himself would take 3000 times roughly a 100 quardillion to evaluate every ordering. And we might have to evaluate every ordering because the last arrangement could be the correct one. Just for fun let's say 3000 quadrillion seconds is roughly equal to a septillion seconds. We know that the successful system has to find the correct chronological order in less than 30 seconds. So how do we take a little more than septillion seconds of work and distribute in such a way that it can be done by some number of agents operating in parallel in less than 30 seconds. What if each agent gets a dedicated core to work with. How many cores would it take to fit a septillion seconds of work into under 30 seconds? How many mutexes, semaphores, pipes, and queues are we talking about if the agents need to communicate and update some shared piece of information simultaneously? Should we just go out and buy the biggest box(es) available and the most recently announced parallelization tool(s) that will make parallelism transparent to the application and 'just-get-er-done'? Keep in mind we're only talking about arranging 19 e-mails here and I'm sorry to say that SFAQL Parallelism won't work here.

We will have a great deal to say about this problem because it's extremely representative of many problems we encounter in software engineering and computer science every day. In fact because of the complexity inherent in the Internet and the many forming Clouds on the horizon we will have software agents that will be faced with problems many magnitudes of order bigger than the problem of fitting a septillion seconds of work into 30 seconds. So how we model that work at the design level will be critical to our ability to successfully create a manageable, reliable, and safe system. Obviously the paradigm that lead us to a septillion seconds of work being forced into under 30 seconds is the wrong paradigm. If you do the math you'll find out that Holger Hoos from British Columbia University was onto something in his "Taming the Complexity Monster" talk. Tracey and I continue to harken back to the ghosts of ICOT and the Fifth generation project for a reason. We suspect while facing the precipice of one set of problems the ICOT researchers may have unknowingly uncovered an aperture that could shed light on our current dilemma of operating systems, parallel processing, and programming paradigms.

Today it is applications with hundreds to thousands of threads. What will do we do when its a hundred thousand or more threads? What happens to synchronization at that level? How will we even make simple programming changes when hundreds of mutexes, locks, and synchronization mechanism are all simultaneously in play. What will the documentation look like? What will change management look like? What super slick doodads will the vendors be selling then? One thing is certain, SFAQL Parallelism won't work because it produces unpredictable, unreliable, unmaintainable, brittle systems that cannot scale and that are impossible to understand even by the original team once a little time has passed. Going forward we need to place far more emphasis on devising design tools that will help us design correct, reliable, and safe systems. Tools that truly integrate with the culture of the development team or group. Design methodologies that will be around for the long haul.

Metaphysical Logical Positivist Post-Modernistic Parallel Philosophy Thought For Today

Perhaps large monolithic imperative-procedural-based system designs are reaching their practical limits. Maybe instead of larger more complex, we should be thinking smaller simpler. Maybe instead of thinking imperative-procedural, we should be thinking declarative-inferential. Maybe instead of thinking parallel and concurrent, we should be thinking in terms of induction, recurrence, and recursion. Hive and colony instead of cluster and network.]]> QuickThread: A New C++ Multicore Library 2009-11-16T17:57:30Z 2009-11-16T15:54:59Z tag:,2009:/108.52760 2009-11-16T15:54:59Z

NUMA (Non-Uniform Memory Access) architectures are becoming popular in HPC (High-Performance Computing) scenarios. Therefore, it is very important to work with efficient and optimized memory allocators. QuickThread is a new commercial C++ multicore programming library loaded with many optimizations for... ghillar gastonhillar@hotmail.com Parallel Tasking NUMA (Non-Uniform Memory Access) architectures are becoming popular in HPC (High-Performance Computing) scenarios. Therefore, it is very important to work with efficient and optimized memory allocators. QuickThread is a new commercial C++ multicore programming library loaded with many optimizations for NUMA architectures, bringing a new option to create high-performance parallelized code.

]]> A few months ago, Jim Dempsey, CEO and Chief Architect of QuickThread Programming, LLC, invited me to offer him some feedback about QuickThread's beta versions. I've accepted his offer as I'm always attracted to new development tools, languages, libraries and paradigms related to multicore programming. Thus, I've been able to have early access to many of the features found in its first official release.

One of the most interesting features found in QuickThread is its design to take full advantage of the underlying hardware, considering all the cache levels (L1; L2 and L3) and the NUMA nodes. The developer can experience with different configurations in order to maximize the performance offered by certain algorithms running on very complex and heterogeneous hardware.

QuickThread extends the concept of thread affinity to a new level. It offers developers more control than other libraries because it considers more low-level details of the underlying hardware. For example, it allows a developer to allocate data objects from a particular NUMA node.

Besides, QuickThread focuses on offering developers a simple way to refactor their existing serial code without needing to create additional classes. Developers can work on the same code base and add parallelism replacing existing code snippets. However, as always, it is necessary to consider all the new complexities introduced by concurrent code. QuickThread makes it simpler to replace an existing loop to create a parallelized loop. Nonetheless, the developer has to create code capable of running concurrently without generated undesired side-effects.

Most of the optimizations found in QuickThread are optimal using Intel C++ Compiler Professional Edition for Windows. As it is a C++ library, it can also take advantage of many of the tools offered by Intel Parallel Studio to optimize parallelized code. It competes with Intel Threading Building Blocks, also known as TBB, because it offers a different alternative to parallelize existing C++ code.

QuickThread allows developers to create both 32-bits (x86-32) and 64-bits (x86-64 or EM64T) applications and it supports the following affinity schemes:

• Thread affinity.

• Data binding affinity.

• NUMA-aware affinity.


It provides a tasking system using thread pools with the goal of producing a minimal overhead mechanism for distributing work in multicore and manycore systems. The reduced overhead and the efficient allocator are two of the most impressive features found in this library and can make a big difference compared to other less efficient libraries. In fact, QuickThread can optimize the scheduler at run-time when many different cache organizations and levels appear.

The reduced overhead and the efficient scheduler are even more important when the parallelized code runs in multi-socket configurations with multiple cores in each socket. I do believe QuickThread has many interested features for these environments.

QuickThread offers the following parallel constructs:


parallel_distribute: Schedules a task team to work on different portions of the same task.

parallel_for: Schedules a task team to run across an iteration space divided up evenly to team members or chunked up to team members). It offers a classic for loop parallelization without the need to write an additional class.

parallel_for_each: Schedules a task team across an iteration space divided upon demand by each team member number.

parallel_invoke: Invokes multiple different tasks provided by C++0x Lambda functions.

parallel_list: Schedules a task team to process a single linked list of objects.

parallel_pipeline: Schedules a task team to process a sequence of steps contained within a vector.

parallel_reduce: Schedules a task team across an iteration space divided upon demand by each team member number whilst performing a reduction operation.

parallel_task: Schedules a single task for its execution.


In its first official version, QuickThread offers support for Lambda functions (C++0x). It also offers support to FORTRAN. However, it is not fully implemented in its first official release. One of its main drawbacks is that the comparative analysis with Intel Threading Building Blocks documentation is prepared for developers with previous TBB experience, without an introduction for beginners. Therefore, if you haven't worked with TBB before, you will likely find that it's a bit difficult to understand. If this is your case, you can read QuickThread programmer's reference instead.

The library offers and outstanding performance. Therefore, it is a very interesting option when you're looking for high-performance in C++ with simple code.

If you're interested in the features offered by QuickThread, you can read its "Programmers Reference Guide" and "A Comparative Analysis with Intel Threading Building Blocks"

]]>
Speeding Up Code Without Doing Anything 2009-11-15T18:26:31Z 2009-11-15T16:25:25Z tag:,2009:/108.52746 2009-11-15T16:25:25Z Of all the techniques I use to speed up code, the one I like the most comes with just the press of a button, or more precisely at the swap of a compiler Every Intel compiler has this particular option,... sblair stephen.blair-chappell@intel.com Parallel Worlds Of all the techniques I use to speed up code, the one I like the most comes with just the press of a button, or more precisely at the swap of a compiler Every Intel compiler has this particular option, and I consider to be a great friend. I'm making a point of keeping you in suspense for a little while longer. Let me first tell you a couple of stories that prove the point. ]]> I was involved some time back in supporting a company that was upgrading from version 9.0 to 10.1 of the Intel compiler (we are now on version 11.1). Actually there was no work involved, but I was 'on call' just in case they had any problems. Within a day of upgrading, the project manager wrote to me saying "we must have version 10.1 of the compiler. Our application speed has just doubled in performance." The application was to be their main 'bread winner' for the following two years.

I had a pretty good idea what had happened. In version 9.1 of the compiler, my favourite option had to be turned on explicitly by the developer. Like many users, they didn't get around to reading the user manual or experimenting with some of the compiler switches. In version 10.1 of the compiler Intel changed the default behaviour of the compiler, so my favourite option was already enabled -- hence the speed up in the customers code.

I experienced an even more significant speedup with another customer, resulting in their application running 10 times faster. The application was a car engine simulator which was used in the testing of electronic car management systems.

The engine simulator was first designed in MatLab, and then the resulting C code compiled with the Microsoft compiler. I installed the Intel C++ Compiler which is a plug and play replacement for the Microsoft C++ Compiler, and the simulation sped-up by a factor of 10.

I suppose I have to spill the beans. The option is auto-vectorisation. All recent versions of the Intel compiler have this option available. If you are already using the latest Intel compilers, such as the one that comes with Intel Parallel Studio then the auto-vectoriser is already turned on (unless you have explicitly turned it off).

In the case of the engine simulator, by upgrading the ecosystem to the latest multicore we achieved:

  • An initial speedup of 20% by using the Intel® C++ Compiler on the original hardware
  • A final speedup of over 76 (i.e. 7600%), which consisted of
    • 10 times speedup due to enabling auto-vectorization
    • 7 times speed up due to hardware upgrade

A fuller description of the Engine Simulation project can be found here

A free evaluation of Intel Parallel Studio can be downloaded from here.]]> Multicore Scalability is Already Possible With the Intel Atom Family 2009-11-12T06:36:40Z 2009-11-12T06:27:49Z tag:,2009:/108.52686 2009-11-12T06:27:49Z

Whilst designing and developing applications targeting netbooks and MIDs (Mobile Internet Devices), one of the great questions is whether there is a real possibility of scaling to more cores in the near future. There's no need to think about the... ghillar gastonhillar@hotmail.com Whilst designing and developing applications targeting netbooks and MIDs (Mobile Internet Devices), one of the great questions is whether there is a real possibility of scaling to more cores in the near future. There's no need to think about the future. The Intel Atom family is already offering a dual-core microprocessor with Hyper-Threading, two physical cores and four hardware threads.

]]> I've already talked about downsizing multicore programming skills to take advantage of Intel Atom. After this post, I've received many questions asking about the potential scalability offered by the Intel Atom Family. I'm answering all the questions with this post. It is very important to create applications capable of scaling when more cores are available because there's already a dual-core Atom chip in the market. Furthermore, more cores could appear in the near future as the manufacturing processes improve.

The message is very clear. If you're targeting netbooks and MIDs, you cannot escape from the multicore revolution. A few manufacturers offer netbooks with two single-core Intel Atom microprocessors. Others already use the Intel Atom Processor 330, which includes two physical cores with Hyper-Threading. Therefore, you'll be able to scale your multicore optimized code in mobile devices based on the Intel Atom Family.

If you have an application that takes advantage of multicore architectures and you migrate from a single-core with Hyper-Threading Atom Processor N270 to an Intel Atom Processor 330, you'll move from 2 to 4 logical cores. If the application's parallelized algorithms scale as the number of logical cores increase, you'll be able to translate the new multicore power found in the new microprocessor into additional application performance.

Besides, Intel Atom Processor 330 is not limited to 32-bits. It offers the 64-bits instruction set, known as x86-64 or EM64T. If you take advantage of the power offered by these microprocessors, you can create exciting applications for netbooks and MIDs. Besides, you can work with scalable algorithms, preparing applications for the forthcoming multicore chips.

Just to mention a real-life example, a new parallel programming toolkit, QuickThread, prepared to create high-performance multicore and NUMA aware applications, has also found great performance improvements and an interesting scalability running algorithms using existing Intel Atom microprocessors with different number of cores. It wasn't the original target. However, after some tests, it became a very interesting market for this toolkit. I'll include more detail about QuickThread's capabilities in another post soon.

Two days ago, Intel released an Alpha SDK to jump start developers at writing netbook applications for the Intel Atom Developer Program. If you want to create apps for this new program, remember that you can take advantage of multicore and your applications will be able to scale in the near future, even on netbooks and MIDs.

Most operating systems running on netbooks and MIDs are already prepared to take advantage of more logical cores. I've already talked about Moblin v2.0 being multicore ready. You can find more information on Moblin at Dr. Dobb's Moblin Zone

]]>
Go: A New Concurrent Systems Programming Language from Google 2009-11-11T13:42:23Z 2009-11-11T04:46:58Z tag:,2009:/108.52656 2009-11-11T04:46:58Z Google launched Go, a new systems programming language born with concurrency, simplicity and performance in mind. Do you have time to learn another programming language this year?... ghillar gastonhillar@hotmail.com Parallel Tasking Google launched Go, a new systems programming language born with concurrency, simplicity and performance in mind. Do you have time to learn another programming language this year?

]]> It is really difficult being a developer. There are a lot of new programming languages trying to bring a silver bullet. They try to simplify concurrency whilst keeping the learning curve as smooth as possible by borrowing syntax from popular existing languages. However, is there a promising future for so many new languages?

Go is open source and its syntax is similar to C, C++ and Python. It uses an expressive language with pointer but no pointer arithmetic. It is type safe and memory safe. However, one of its main goals is to offer the speed and safety of a static language but with the advantages offered by modern dynamic languages. Go also offers methods for any type, closures and run-time reflection. The syntax is pretty clean and it is garbage collected. It is intended to compete with C and C++ as a systems programming language.

What about multicore programming with Go? It promotes lightweight concurrency allowing developers to create sets of lightweight communicating processes. Go calls them goroutines. This way, you can run many concurrent goroutines and you don't need to worry about stack overflows. Go promotes sharing memory by communicating. Goroutines aren't threads, they are functions running in parallel with other goroutines in the same address space. It is very easy to launch parallel functions using the goroutines. This is one of the most interesting features offered by the language. It really simplifies concurrency for systems programming.

Go's key features related to concurrency are:

* Channels.

* Channels of channels.

* Goroutines.

* Leaky buffers.

* Share by communicating approach.

These features deserve new posts explaining them with more detail. Stay tuned because I'll be adding new posts about Go soon.

The idea behind Go is to offer a fast compiler to produce fast code. So far, it offers two compilers:

* Gccgo (GCC is in the back).

* 8g (x86-32) and 6g (x86-64).

Today, Go went open source and it is a promising systems programming language. The support offered by modern IDEs will be crucial to its early adoption. Its concurrency features are really interesting for systems programming.

If you want to test this new programming language, your starting point is Go's main page

Besides, you can watch the video of a Google Tech Talk for The Go Programming Language, by Rob Pike. Nonetheless, I must warn you that it's a long video

]]>
There's Good News and There's Bad News ... 2009-11-10T22:32:46Z 2009-11-10T21:13:00Z tag:,2009:/108.52651 2009-11-10T21:13:00Z Last week we gave the webinar scene a break. Instead, Tracey talked me into going with her to the new Bill Gates Hillman Complex located on Carnegie Mellon campus in Pittsburgh. Holger Hoos from the University of British Columbia was... chughes ctestlabs@ctestlabs.org Multicore Moments Last week we gave the webinar scene a break. Instead, Tracey talked me into going with her to the new Bill Gates Hillman Complex located on Carnegie Mellon campus in Pittsburgh. Holger Hoos from the University of British Columbia was giving a talk called "Taming the Complexity Monster". I figured how could we go wrong! ]]> Since we both are currently wandering in the wilderness of AI-Complete problems and our multicore machinations are turning out to give us deceptively correct answers. I thought just maybe Holger Hoos might have some encouraging words for us. After all Dr. Hoos is a founding member of the Bioinformatics, Theoretical and Empirical Algorithms Laboratory and a member of the Laboratory for Computational Intelligence with additional involvement with the Institute for Computational Intelligence and Cognitive Systems. Surely he could shed some light on our massive multicore vs. AI-Complete problem issue.

I remember thinking on the drive down that it felt like we were off to see the wizard. So after taking a wrong turn somewhere around Veteran's Bridge in downtown Pittsburgh and getting lost, we finally made it to the Gates Hillman Complex.

The Bill Gates Hillman Complex located on Carnegie Mellon campus

There were several eye-popping and jaw-dropping moments in the talk. Tracey and I were especially encouraged when he started talking about problems that were similar to the ones we are currently challenged with.

We weren't the only ones that were captivated by the talk. You could almost feel the excitement in the air. But with one click of the mouse everything went down hill for our multicore expectations. Out came the PowerPoint. Out came the performance graphs. Out came the truth (atleast the truth according Holger). He demonstrated that if we had one processor for every atom in the universe and if each processor could execute instructions at the speed of light that a parallel computational solution for the class of problems that we are trying to solve would take millions, billions, trillions of years to complete! Ouch! That was very bad news to hear.

He also suggested that for the class of problems that Tracey and I are stuck with that even with such large-scale massive parallelism, it would only offer extremely negligible speed up in relation to problem size. How could that be? I immediately pulled out my PSP (really I did) and then logged in to our private lab and research cluster, typed some numbers into Mupad, and slowly the bad news was retraced on the screen of my PSP.

And just as we were about to succumb to Holger's talk, he introduced a new approach to algorithm design. He had some new approaches to some of the classic NP-hard problems. These new approaches had a dramatic effect on algorithm design that brought computational solutions within practical reach. He introduced the Concorde TSP Solver, DPLL (Davis-Putnam-Logemann-Loveland) algorithm, and the new field of empirical algorithms approaches. As I looked closer I saw what at least looks like to me a striking intersection between some of the empirical algorithms and some of the findings of ICOT and the Fifth generation project. So it would seem that there is hope after all and that the ghosts of ICOT are trying to tell us something and that some of the stones we've turned over could lead to significant paradigm shifts for parallel programming and multicore computers. So our trip turned out to be good-news, bad-news.

The good news was that we heard an approach to solving our particular brand of problems. The bad news was that we heard an approach to solving our particular brand of problems but there was no short cut in sight, no easy way out, no silver bullet, no rest for the wicked.]]> Visual Studio 2010 Multi-Monitor Support Helps Debugging Parallel Code 2009-11-10T13:29:28Z 2009-11-10T03:14:28Z tag:,2009:/108.52629 2009-11-10T03:14:28Z

Debugging a parallelized application requires more information on the screen than debugging sequential code. Sometimes, even Full HD monitors aren't enough to display all the necessary windows at the same time. Luckily, Visual Studio 2010 Beta 2 offers a very... ghillar gastonhillar@hotmail.com Debugging a parallelized application requires more information on the screen than debugging sequential code. Sometimes, even Full HD monitors aren't enough to display all the necessary windows at the same time. Luckily, Visual Studio 2010 Beta 2 offers a very intuitive multi-monitor support.

]]> You already know that Visual Studio 2010 Beta 2 offers many new parallel debugging windows. I've talked about them in my previous post "New Parallel Debugging Windows in Visual Studio 2010 Beta 1". Furthermore, it also introduces new analysis and profiling tools, like the one explained in my post "Visualizing Parallelism and Concurrency in Visual Studio 2010 Beta 2". These new tools bring new windows and palettes to your Visual Studio IDE.

If you work with parallel code for many hours using one monitor, you'll find yourself moving many windows to organize the information on your screen whilst debugging. You'll want to check the information provided by the Parallel Stacks window. However, at the same time, you'll want to see the tasks and their related threads on other windows. I'm talking about debugging parallel code. Therefore, you'll also want to see the code that's being executed. Did I mention locals, call stack, watches, among other windows?

I've been working with dual-monitor workstations for many years now. However, Visual Studio 2010 Beta 2 simplifies the usage of multiple screens to distribute its windows and palettes. This version detects the presence of a dual-monitor configuration and it allows you to simply drag and drop the different windows and palettes to the desired screen. This way, you can drag and drop any window or palette integrated in the development environment. What happens if you disconnect the second monitor? All the windows and palettes that were displayed on that monitor will automatically move to the visible screen area. Simple and intuitive. Great for parallel developers that need dozens of windows at the same time.

It isn't limited to the default windows and palettes provided by the development environment. You will also be able to organize the windows created by extensions like Intel Parallel Studio.

You usually have to switch to multi-monitor mode in many notebooks and laptops using a shortcut. Most modern desktop computers and workstations also include GPUs with support for at least two monitors. Once you activate the dual-monitor mode, Visual Studio 2010 will detect the presence of the new screen. The following picture shows a notebook's dual-monitor activation icon:

Figure 1: Activating a dual-monitor configuration.

The following picture shows a typical parallel debugging scenario, with many windows on a single screen. As you can see, it is necessary to move and resize the windows to be able to see all the necessary information.

Figure 2: Too many windows on a single screen.

The following two pictures show the same windows distributed in two independent screens:

Figure 3: Parallel Stacks, Parallel Tasks, Threads, and Autos windows organized to see all the necessary information on the left screen.

Figure 4: The code, the menu, and the Immediate window organized to see all the necessary information on the right screen.

The following picture shows a notebook and an LCD monitor displaying the two previously explained windows in two screens, taking advantage of the multi-monitor support offered by Visual Studio 2010 Beta 2.

Figure 5: A dual-monitor configuration in action.

It is easier to debug parallel code when you can see all the necessary information without having to move and resize your windows every 5 seconds.

Disclaimer: This post isn't sponsored by a monitor manufacturer. I've found it really useful to work with Visual Studio 2010 Beta 2 new multi-monitor support whilst developing and debugging parallelized applications.

]]>
Sequential Programming: Like Eating Peas with a Straw. 2009-11-03T18:40:41Z 2009-10-30T14:29:01Z tag:,2009:/108.52450 2009-10-30T14:29:01Z Before the era of multicore chips, performance gains in CPUs was achieved by a combination of ever increasing speed and architectural enhancements. This resulted in more and more power being consumed by the processor -- a situation that could not... sblair stephen.blair-chappell@intel.com Parallel Worlds Before the era of multicore chips, performance gains in CPUs was achieved by a combination of ever increasing speed and architectural enhancements. This resulted in more and more power being consumed by the processor -- a situation that could not continue forever. ]]> Something's cooking

I remember a tongue-in-cheek competition 'alternative uses of the Pentium' that came up with some entertaining suggestions. The winner suggested wiring four Pentiums together and using them as a cooker hob. Very amusing! 

The multicore race is here

Rather than making processor faster and faster, the received practice now is to get extra performance by multiple cores. Recently I read a news item that said a start-up company was proposing to make the first 100 core CPU -- claiming that they would 'pip Intel to the post.' So rather than the MHz race we had in the '80s and '90s, we are now entering the multicore race. It seems to be that the multicore era is here to stay -- so we'd better get used to the idea.   

Back in 2007, Intel announced to the public it's 80 core research chip. It made some  programmers wonder how on earth a program be written to take advantage of so many cores.  Writing for 2 or 4 cores seemed manageable, but 80 cores seemed unimaginable. 

Figure 1: Intel's 80 core research chip (circa 2007) .

Speed
GHz
Power
Watts
Perf.
Teraflops
3.16
62
1.01
5.1
175
1.63
5.7
265
1.81

Figure 2: Performance of the 80 core chip.

The performance of this 80 core chip is over 1 Teraflop.  Interestingly,  the first teraflop computer  ASCI Red was considerably bigger and was decommissioned in 2006.

Figure 3: ASCI Red, the first Teraflop computer. Try fitting this in your garage!

Software tools are the real challenge

The challenge for the programmer is how to write programs to take advantage of so many cores. Thankfully companies such as Intel (did I say I work for them ) are putting huge efforts and resources into enabling programming in parallelism.  The Intel Parallel Studio is an example of a tool suite that can be used to write parallel applications.  Its good that the semiconductor industry is taking a lead in developing software tools as well as silicon, otherwise programming for these newer devices would be something akin to eating peas with a straw -  entirely possible , but not very efficient.

 

]]>
Biomolecular device using self-assembled DNA nanostructures? 2009-10-30T05:54:45Z 2009-10-30T05:18:04Z tag:,2009:/108.52441 2009-10-30T05:18:04Z As I sit at my computer with it multicores considering the advantages of parallelism, faster computers, better performance, a strange feeling comes over me, 'Haven't I heard this before?'... chughes ctestlabs@ctestlabs.org Multicore Moments As I sit at my computer with it multicores considering the advantages of parallelism, faster computers, better performance, a strange feeling comes over me, 'Haven't I heard this before?'

]]> We are constantly inundated with the idea of 'better performance' or 'the most current' whether it be a computer, a complier, an operating system, a library, a development tool, or whatever, there is always a promise of something better, something current, something that will speed up the development process.

As a professional you must increase your skill level to accommodate it, 'what is the learning curve for that new complier, that new API, development environment, operating system and of course all of those cores just setting there with nothing to process, I must find a way to utilize them'. It is demanding being a professional, you can't lag behind this stuff. So we search for help, tools, and workshops and conferences to bring us up to speed in a hurry. You know 'got to meet those deadlines, and it would be great if I had some new mojo to throw in that system so I can get better performance'.


A few decades ago a similar problem was described:

“Since its early days, most research in computer science was concerned in one way or another with two problems:

1. Computers are too slow
2. Programmers are too slow”

That was stated 22 years ago by Ehud Shapiro in his book Concurrent Prolog Volume 1. This is a book of collected papers on, yep you guessed it, research related to ICOT. Will we just continue to develop new computers and then beat ourselves up trying to use them? And I mean just use them, let alone use them efficiently. How can we keep this up! How practical is it to get on top of these new technologies with the approach of attending a workshop, reading a book, reading a blog? What about truly radical stuff. What about new paradigms of computing, what happens when they become mainstream.

Well, what's coming down the pike, atomic, molecular and quantum computing? Some are 'non-silicon-based'. Some extend von Neumann architecture, others are radically non-von. Now what? They will require new programming models, new algorithms and new languages. They are not intended to replace silicon-based computers. They maybe better but better when used for certain domains and applications. Considering the state of the contract programming, one way to survive it may be to become an 'expert' in programming using one of the new upcoming paradigms. But how in the world do you become an expert in programming a 'biomolecular device using self-assembled DNA nanostructures'? I am waiting for that workshop!

]]>
Coreinfo v2.0: A Simple Utility to Understand the Manycore Complexity in Windows 2009-10-27T23:12:15Z 2009-10-27T22:51:52Z tag:,2009:/108.52387 2009-10-27T22:51:52Z Windows Server 2008 R2 and Windows 7 (64-bits version) offer new NUMA (Non-Uniform Memory Access) support. Therefore, it is very important for Windows developers to understand the differences found in the complex underlying multicore and manycore hardware. Coreinfo is a... ghillar gastonhillar@hotmail.com Windows Server 2008 R2 and Windows 7 (64-bits version) offer new NUMA (Non-Uniform Memory Access) support. Therefore, it is very important for Windows developers to understand the differences found in the complex underlying multicore and manycore hardware. Coreinfo is a very simple yet powerful command-line utility that shows you very useful information about the processors, their organization and the cache topology.

]]> A few days ago, Mark Russinovich, a well-known member of Windows Sysinternals team made the new version v2.0 of Coreinfo available for download.

This command-line utility runs on most modern Windows versions and displays information about the mapping between logical cores (logical processors or hardware threads) and the physical cores. Besides, it shows information about the NUMA nodes, groups, sockets and all the cache levels. This information is very important to understand the underlying hardware. When you benchmark multicore performance, the great differences between many multicore architectures can make it really difficult to tune the application for a specific architecture. Using this command-line utility, you can easily save the information about the underlying hardware before running your benchmarks and performance tests.

The new version supports Windows Server 2008 R2 systems with more than 64 logical processors (logical cores or hardware threads). Besides, it is also compatible with IA-64 architectures.
You don't need to run an installer. You can unzip the executable file and run it from the command-line.

The utility uses the GetLogicalProcessorInformation Windows API function to obtain all the information displayed on the screen. Therefore, you can also obtain this information in your applications to tune performance according to the underlying hardware architecture. In fact, if you plan to create applications targeting manycore systems with multiple NUMA nodes, you'll have to take into account the detailed cache topology if you want to exploit the underlying hardware.

The results of running Coreinfo v2.0 on an Intel Atom N270 powered netbook are the following:
Logical to Physical Processor Map:
** Physical Processor 0 (Hyperthreaded)
Logical Processor to Socket Map:
** Socket 0
Logical Processor to NUMA Node Map:
** NUMA Node 0
Logical Processor to Cache Map:
** Data Cache 0, Level 1, 24 KB, Assoc 6, LineSize 64
** Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
** Unified Cache 0, Level 2, 512 KB, Assoc 8, LineSize 64

There is just one physical core. However, as this CPU offers Hyper-Threading technology, Coreinfo tells you it is Hyperthreaded.

The results of running Coreinfo v2.0 on an Intel Core 2 Duo P8600 powered notebook are the following:
Logical to Physical Processor Map:
*- Physical Processor 0
-* Physical Processor 1
Logical Processor to Socket Map:
** Socket 0
Logical Processor to NUMA Node Map:
** NUMA Node 0
Logical Processor to Cache Map:
*- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
*- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
-* Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
-* Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
** Unified Cache 0, Level 2, 3 MB, Assoc 12, LineSize 64

Coreinfo uses an asterisk "*" to represent a mapping. In this case, there are two physical cores and two logical cores as there isn't Hyper-Threading technology. Besides, there is a unified 3 MB Level 2 cache memory. Both physical cores share this cache, therefore, Coreinfo shows two asterisks "**" on the left side of the last line. This means that the cache is mapped to both processors:
*- =Physical Processor 0
-*=Physical Processor 1

Therefore, ** means Physical Processor 0 and Physical Processor 1.

The results of running Coreinfo v2.0 on an Intel Core 2 Quad Q6600 powered workstation are the following:
Logical to Physical Processor Map:
*--- Physical Processor 0
-*-- Physical Processor 1
--*- Physical Processor 2
---* Physical Processor 3
Logical Processor to Socket Map:
**** Socket 0
Logical Processor to NUMA Node Map:
**** NUMA Node 0
Logical Processor to Cache Map:
*--- Data Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
*--- Instruction Cache 0, Level 1, 32 KB, Assoc 8, LineSize 64
-*-- Data Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
-*-- Instruction Cache 1, Level 1, 32 KB, Assoc 8, LineSize 64
**-- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64
--*- Data Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
--*- Instruction Cache 2, Level 1, 32 KB, Assoc 8, LineSize 64
---* Data Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
---* Instruction Cache 3, Level 1, 32 KB, Assoc 8, LineSize 64
--** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64
Logical Processor to Group Map:
**** Group 0

In this case, there are four physical cores and four logical cores as there isn't Hyper-Threading technology. Besides, there are two unified 4 MB Level 2 cache memories. Each pair of physical cores share this cache, therefore, Coreinfo shows asterisks to identify the processors mapped to each cache:
*---=Physical Processor 0
-*--=Physical Processor 1
--*-=Physical Processor 2
---*=Physical Processor 3

Therefore, **-- means Physical Processor 0 and Physical Processor 1, and --** means Physical Processor 2 and Physical Processor 3.

These are the two lines that display the information about each unified cache mapped to each pair of physical processors:
**-- Unified Cache 0, Level 2, 4 MB, Assoc 16, LineSize 64
--** Unified Cache 1, Level 2, 4 MB, Assoc 16, LineSize 64

In the aforementioned examples, there is just one NUMA node. Some of the results of running Coreinfo v2.0 on a server powered by two quad-core AMD Opteron 2379 HE microprocessors with a NUMA architecture are the following:
Logical to Physical Processor Map:
*------- Physical Processor 0
-*------ Physical Processor 1
--*----- Physical Processor 2
---*---- Physical Processor 3
----*--- Physical Processor 4
-----*-- Physical Processor 5
------*- Physical Processor 6
-------* Physical Processor 7
Logical Processor to Socket Map:
****---- Socket 0
----**** Socket 1
Logical Processor to NUMA Node Map:
****---- NUMA Node 0
----**** NUMA Node 1

In this case, Coreinfo shows very useful mapping information related to NUMA nodes.

As it is a command-line utility, it is very simple to run it and redirect its output to a text file. For example:

coreinfo > cpudetails.txt

Saves all the information to the cpudetails.txt file.

The application offers many parameters to select the information to dump:
-c Dump information on cores.
-g Dump information on groups.
-l Dump information on caches.
-n Dump information on NUMA nodes.
-s Dump information on sockets.

You can download Coreinfo v2.0 here

]]>
Visualizing Parallelism and Concurrency in Visual Studio 2010 Beta 2 2009-10-23T08:59:22Z 2009-10-23T03:39:08Z tag:,2009:/108.52289 2009-10-23T03:39:08Z Visual Studio 2010 Beta 2 includes many interesting improvements related to its multicore programming features. The parallelism and concurrency profiling tools allow developers to visualize the behavior of a multithreaded application on multicore microprocessors and collect resource contention data.... ghillar gastonhillar@hotmail.com Parallel Tasking Visual Studio 2010 Beta 2 includes many interesting improvements related to its multicore programming features. The parallelism and concurrency profiling tools allow developers to visualize the behavior of a multithreaded application on multicore microprocessors and collect resource contention data.

]]> If you want to translate multicore power into application performance, you have to make sure your concurrent software threads are running on hardware threads taking advantage of parallelism. Visual Studio 2010 Beta 2 improved many profiling reports related to parallelism and concurrency.

The IDE uses the name Concurrency. However, I'd rather talk about both parallelism and concurrency. When you create a multithreaded application, using task-based programming or raw threads, you're creating concurrent code. Nonetheless, it doesn't mean that the concurrent code is going to run in parallel all the time. It depends on the decisions taken by the operating system scheduler, the underlying hardware and the synchronization problems, among others. Therefore, it is necessary to evaluate whether the programmed concurrency is taking advantage of certain parallel hardware capabilities. Are the software threads running in parallel taking advantage of the existing hardware threads? The new Concurrency profiling tools offered by Visual Studio 2010 Beta 2 provide nice information to answer this question. Again, this tool allows you to visualize parallelism and concurrency, not just concurrency.

This option works with Visual Studio 2010 Beta 2 Premium or Ultimate versions. Besides, it requires Windows Vista, Windows 7, Windows Server 2008 or Windows Server 2008 R2.

Before beginning, you must run Visual Studio 2010 Beta 2 as Administrator. Then, you can open the multithreaded solution to analyze and select Analyze, Launch Performance Wizard… from the main menu. I'm going to explain some of the results offered activating the options Concurrency (Parallelism and concurrency in my parallel programming language), Collect resource contention data and Visualize the behavior of a multithreaded application, as shown in the following picture:





Specifying the desired profiling method.

If you're working on a 64-bits operating system, you'll probably see a dialog box whit this message "To enable complete call stacks on x64 platforms, executive paging must be disabled. A reboot is then required. To make this change, click "Yes", save your work, and then reboot.", as shown in the following picture:

On 64-bits operating systems, the IDE will disable executive paging and force you to reboot.

You have to take into account that the application is going to take more time to run whilst being profiled. Once the application finishes or the profiling session is interrupted, Visual Studio will start analyzing the generated report.

Minor criticism, the IDE usually takes a long time to analyze the report. It doesn't take advantage of multicore in order to run this CPU-intensive process… I think that multicore programming analysis tools should be optimized for multicore. However, remember that I'm talking about Beta 2. As a multicore developer, I expect multicore development environments to take full advantage of modern multicore microprocessors.

The first graph will show a concurrency visualization, displaying the wall clock time, as shown in the following picture:

Visualizing the behavior of a multithreaded application.

Then, you can click on CPU utilization and Visual Studio will display the average CPU utilization for the analyzed process on a graph, considering the available hardware threads (logical cores). In this case, the average CPU utilization was 86%, as shown in the following picture:

Visualizing the CPU utilization.

However, you have to be careful whilst analyzing this graph. As I explained in my previous post, "TMonitor: Understanding What Happens With Each Hardware Thread", some technologies like Enhanced Intel SpeedStep Technology and Intel Turbo Boost Technology affect the CPU utilization. Besides, a high CPU utilization percentage could mean huge synchronization overheads. Remember to measure speedup and scalability considering the execution time with different hardware threads (logical cores) before profiling.

Then, you can click on Threads and Visual Studio will display visual timelines for the disks activities, the main thread and all the worker threads. This is a very useful visualization because it helps to split between execution and synchronization times. Visual Studio uses different colors, as shown in the following visible timeline profile:

Visual Studio uses different colors to fill the timelines and offers a very clear summary.

The following visualization shows the result of running an application that creates groups of worker threads to take advantage of four hardware threads (logical cores). It is not using the work stealing queues offered by .Net 4.0 Beta 2:

Visualizing timelines for each worker thread.

The application uses raw threads. Therefore, it is very easy to see that it is not reusing threads to schedule tasks. It is very important to reduce the thread creation overhead and the existing synchronization to optimize the application. The profiler offers very useful information.

Finally, you can click on Cores and Visual Studio will display how each software thread was executed on each available hardware thread (logical core). In this case, the application ran on a quad-core CPU with 4 hardware threads (4 logical cores and 4 physical cores), as shown in the following picture:

Visualizing the software threads running on the available hardware threads (logical cores).

Besides, the profiler summarizes the cross-core context switches, the total context switches and the percent of context switches that cross cores.

These new visualization options are really useful to optimize applications to help developers using Visual Studio to successfully translate multicore power into application performance. There are many additional options. This is just an introduction to the new views. I'll be adding real-life examples related to parallel programming and profiling using the new features found in Visual Studio 2010 Beta 2.

]]>
Microsoft Visual Studio 2010 and .Net Framework 4.0 Beta 2 Is Out 2009-10-20T02:58:30Z 2009-10-20T02:50:12Z tag:,2009:/108.52193 2009-10-20T02:50:12Z Microsoft released the Visual Studio 2010 Beta 2 that comes with the new .Net Framework 4.0 Beta 2. The previous Beta 1 offered many interesting features that empowered parallel programming using the supported languages. However, the IDE had many important... ghillar gastonhillar@hotmail.com Microsoft released the Visual Studio 2010 Beta 2 that comes with the new .Net Framework 4.0 Beta 2. The previous Beta 1 offered many interesting features that empowered parallel programming using the supported languages. However, the IDE had many important performance problems.

]]> Visual Studio 2010 Beta 1 offered a nice IDE. Nonetheless, I was very happy uninstalling this version today. Developers that want to create parallalelized code in order to translate multicore power into application performance need a fast IDE. So far, it seems Visual Studio 2010 Beta 2 solves many of the performance issues found in the previous Beta version. Besides, it allows working with a go live license. This means you can start using Visual Studio 2010 Beta 2 for production related work, considering the license agreement.

This new release changes the SKU line up. Now, you'll find the following versions:

Microsoft Visual Studio 2010 Express.

Microsoft Visual Studio 2010 Professional with MSDN. If you're going to create parallelized code with the new Visual Studio, this is the lowest version you should use.

Microsoft Visual Studio 2010 Premium with MSDN. Includes profiling and debugging, advanced database support and UI testing, among other features.

Microsoft Visual Studio 2010 Ultimate with MSDN. This version targets architects, testers and the most demanding developers.

The new names are similar to the different Windows 7 versions.

The new Beta 2 offers new project types. Now, you can target Windows Azure and Sharepoint with Visual Studio 2010. Besides, it supports ASP.NET MVC 2 and Silverlight 3.

There are some interesting improvements in the background garbage collection and in the Task Parallel Library that deserve the upgrade from the previous Beta version.

As this is the second Beta release, it's time to test the new performance improvements related to multicore programming. It is also a great opportunity to check whether this new version is prepared to allow developers to take advantage of the newest NUMA architectures, supported by both Windows 7 and Windows 2008 R2. The new framework offers many improvements in the Concurrency and Coordination Run-time and it also solves many bugs.

MSDN subscribers can download Visual Studio 2010 and .Net Framework 4.0 Beta 2 today (October 19th). Non-subscribers will have to wait for October 21st, just two days.

If you are an MSDN subscriber, you can log-in and download.

If you aren't an MSDN subscriber, check Visual Studio 2010 and .Net Framework 4.0 Web page and wait until Wednesday.

So far, you will have to wait until March 22nd for the final version. It's time to test the new features and improvements found in Beta 2!

]]>
Logical Inferences Per Second (LIPS) vs. Horsepower 2009-10-19T15:59:21Z 2009-10-19T07:46:55Z tag:,2009:/108.52169 2009-10-19T07:46:55Z Of course we're a little jealous of those developers who get to develop those fun and nifty IPhone apps! Perhaps we're just a tad bit curious too. But for the moment we are absolutely in the grips of a very... chughes ctestlabs@ctestlabs.org Multicore Moments Of course we're a little jealous of those developers who get to develop those fun and nifty IPhone apps! Perhaps we're just a tad bit curious too. But for the moment we are absolutely in the grips of a very different kind of software development.

]]> It is a tricky distinction to make when we talk about the computer in terms of solving the problem as opposed to being part of a solution to problem. I guess in some ways its all a matter of perspective.

So if we can put it another way, we are currently involved in developing software that uses the process of inference to reach conclusions that when applied or understood represent the solution to some (normally nasty) problem or set of problems. This inference process usually takes the form of logical deduction, induction, or abduction. We want to contrast the process of inference to that of computation. So instead of focusing on the work required for a computation we focus on the work required for an inference. In particular we look at how many logical inferences an agent might use to solve some problem. Normally we are concerned with how many inferences an agent might use in the worst case scenario to solve the problem. Once we have some idea of how many inferences that is then we know how big the problem space is and how much computing power we could potentially need. For instance, in our 19 emails project, the worst case scenario for our agent was 19! * 3000 inferences. From a hardware perspective if I have a quad core processor and each processor is clocked at 1.8 GHZ, what are we talking about in terms of LIPS? That is 1.8 GHZ = How many LIPS?

Much like horsepower the metric of LIPS is from days gone by. The researchers at ICOT and the Fifth Generation Project used it as a way to measure system performance. ICOT had a target of 100 MLIPS (million LIPS) to 1 GLIPS (billion LIPS). At the time 1 Logical Inference Per Second took approximately 100 machine instructions. But who knows what machine instructions are these days? (I heard recently that they are similar to lines of Javascript.) Remember parallelism even massive parallelism was at the heart of ICOT and the Fifth Generation project. They produced answers in that project for questions that we are really just now asking today. In that project LIPS was the basic unit of measure for system performance. The ghosts of ICOT have presented me and Tracey with a very seductive argument and we are currently on some DaVinci-Code-Indiana-Jones-Raider-of-the-Lost-Ark adventure to discover the true path to massive parallelism by looking at the stones that were turned over by ICOT and the stones that weren't. At one point in time horsepower was a very hands on metric. The common man knew how many horses it took to plow a plot of land, pull up a tree stump, or get a stage coach to town in a day. At some point when steam engines are introduced horsepower was the metric to use to compare how much work a steam engine could get done. I'm still a little baffled why 400 horsepower is still used as the metric for my little SUV. I have no idea how many horses it takes to pull up a tree trunk or plow a field. I am sure it has meaning for somebody somewhere that has to get their hands dirty. LIPS has a similar function for those of us stuck with the totally futile task of trying to solve AI-complete, or AI-hard problems. Somebody manages to think of our cute little SUV in terms of 400 horsepower and Tracey and I have to think of duo, quad, and eight core * eight hardware thread processors in terms of LIPS. There is a very important relationship between LIPS and our ability to cope with massive parallelism. Stay tuned ... ]]> 19 E-mails, How Many Lines of Javascript Per Instruction Does it Take? 2009-11-16T15:52:46Z 2009-10-16T05:29:28Z tag:,2009:/108.52140 2009-10-16T05:29:28Z

On one hand it's funny. On the other hand, well ... it's funny. It's probably a matter of poetic justice being served up. Something I did or something Tracey did in a past life. But recently we seem unable to... chughes ctestlabs@ctestlabs.org Multicore Moments On one hand it's funny. On the other hand, well ... it's funny. It's probably a matter of poetic justice being served up. Something I did or something Tracey did in a past life. But recently we seem unable to escape conversations that end up in questions (which we normally evade) about what we do.]]> Maybe it's just Karma, but somehow we end up in conversations with some happy lad that has just developed some nifty little IPhone app or some web-based, google-earth-enabled-recommender page and he's absolutely excited about it, and we're excited for him.

We think these conversations might be prompted by the fact that when they happen we tend to be sitting somewhere with our notebook computers open and puzzled looks on our faces. The person will unabashedly look at one or both of our screens and then say "I develop software too. That looks like some kind of Javascript you're working on there" and then they proceed to share their enthusiasm about their newly finished app. This has been a recurring theme for the past year or so. Well it happened again yesterday while we were sitting in the cafe of one of our local booksellers and this time I decided to really share what we were working on. After listening to an almost salesman quality presentation on how this guy's app is Windows 7 ready and it will provide seamless access to the power of cloud computing and all the cores in a computer, he asked what were we working on. He said , "That looks a little like Javascript. Is it server side or client side?"

I said actually its C++. It's part of some code that implements a rational agent. We're trying to get this agent to help us solve a problem. I then proceeded to explain the problem (well a version of the problem that wouldn't violate any agreements we had signed). I said we have these 19 emails. The problem is we have no header information for them and no time date stamp information at all for any of them. These e-mails were originally sent between multiple individuals over a 3 or 4 year period. The order that they were sent in has been lost. The emails contain very sensitive information and that sensitive information is only completely visible if we look at the e-mails in the original order they were sent. I then let him know that we measure our agent's work in how many logical inferences some task takes. In this case our agent does 1 Logical Inference per second (LIPS) and it takes approx 3000 LIPS to put the e-mails into one potential order. He asked well how does LIPS compare to the speed on his 3G IPhone? Not knowing exactly how to respond, I said well 1 LIP is approximately equal to 100 machine instructions. He then said you mean to tell me you need 100 lines of Javascript just to do one LIP. Again, I wasn't quite sure how to respond. So I just stated what the real problem was. I said pal, we have 19 emails. We don't know what the original order was. We have to find out what the original order was without the benefits of date or time stamp information of any kind. Presently we are using the computer to arrange the e-mails in some arrangement of 19. Then we ask the computer to read the e-mails from start to finish to see whether our sensitive information becomes visible. If its not visible then the computer comes up with another sequence. It takes the computer about 3000 seconds to do one group of 19. So the lad asked how many groups of 19 does the computer have to look through before finding the information. I said don't know maybe all of them. He said how many is that? So I opened up MuPad typed in fact(19) and back came the number 121,645,100,408,832,000 (without the commas of course). He said naw. I said yeah. He said naw ... I said trust me. One agent can process 1 arrangement every 3000 seconds. I asked him if I could dedicate one agent per every core in my notebook, how many cores would I need in my notebook to guarantee the return of the sensitive information in under 30 seconds. At that point I was about to make some snappy remark about checking the answer on his 3G IPhone but I didn't. He looked at me and said "I see you have your work cut out for you", and I said, 'Yep and I have no idea how many lines of Javascript per instruction I'm gonna need.' ]]>